<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Thought-provoking Question Matrix to Guide the Development of Foundation-Model-based Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sietske Tacoma</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jimmy Mulder</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthieu Laneuville</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Leijnen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SURF</institution>
          ,
          <addr-line>Moreelsepark 48, 3511 EP, Utrecht</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Utrecht University of Applied Sciences</institution>
          ,
          <addr-line>Heidelberglaan 15, 3584 CS, Utrecht</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Organizations feel an urgency to develop and implement applications based on foundation models: AI-models that have been trained on large-scale general data and can be finetuned to domain-specific tasks. In this process organizations face many questions, regarding model training and deployment, but also concerning added business value, implementation risks and governance. They express a need for guidance to answer these questions in a suitable and responsible way. We intend to offer such guidance by the question matrix presented in this paper. The question matrix is adjusted from the model card, to match well with development of AIapplications rather than AI-models. First pilots with the question matrix revealed that it elicited discussions among developers and helped developers explicate their choices and intentions during development.</p>
      </abstract>
      <kwd-group>
        <kwd>foundation models</kwd>
        <kwd>use cases</kwd>
        <kwd>model cards 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With the recent advent of foundation models, defined as general purpose AI-models that
have been trained on large-scale data, organizations are more eager than ever to develop
AI-powered applications. Foundation models have quickly built a reputation as powerful
building blocks for domain-specific applications, by diminishing the need to explicate the
logic needed for such applications [1]. They perform well on numerous general tasks such
as text and image generation, speech recognition and graph creation [2]. Furthermore, with
only limited further training, they can quickly outperform more traditional AI-models on a
wide variety of domain-specific tasks. It is no wonder that organizations see the potential
of foundation models and feel an urgency to explore use cases in which foundation models
can be of added value for their organization.</p>
      <p>Developing applications based on foundation models also raises many questions and
challenges for organizations. These include short-term questions, such as whether to use
available foundation models or to train an own model from scratch, whether to use an
existing model as is or to finetune it with own data, and in the latter case, which data to use.
Evaluating performance is also a challenge, as the capabilities of the foundation model that
have been demonstrated on benchmarks may be quite distant from the capabilities required
in the organization’s use case. Long-term strategic topics, such as added business value,
risks and governance, are also a concern [3]. Added value can be conceptualized financially,
in terms of return-on-investment, but also more generally, in terms of for example
efficiency, effectivity and job satisfaction of the people using the applications. Regarding
risks and governance, organizations have concerns about the dependency on models
provided by Big Tech companies such as Microsoft (OpenAI), Google (Deepmind), and
Amazon (Anthropic), the transparency of and possible bias in these models and the
transference of intellectual property, especially when prompting or finetuning these models
with own data.</p>
      <p>Organizations are looking for guidance in addressing these questions and concerns. More
specifically, once a use case has been identified and the decision has been made to start
developing an application based on a foundation model, organizations are looking for ways
to make responsible choices in this process. Many of these choices involve considering
several options and weighing several perspectives (e.g., performance, financial and ethical
aspects). In this paper we present a question matrix to guide reflection on these choices
from different perspectives. By using this question matrix repeatedly during application
development, developers are encouraged to explicate the options and considerations they
have and to track the development of their thinking over time. This has the potential to
foster 1) more deliberate choices in the designed application, both in terms of the
perspectives considered as well as in both short-term and long-term benefits; 2)
transparency about the design of the application; and 3) traceability which enables reuse of
datasets, models, and other components in designing other, similar applications within the
organization.</p>
      <p>In this paper, we describe the design of and first experiences with this question matrix.
We have used model cards [4] as a basis for the question matrix, as further elaborated in
section 2. How we have transformed the model card structure into the question matrix is
described in section 3. Section 4 gives an overview of our first experiences with the question
matrix. In section 5 we present our conclusions and directions for further research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review: documentation approaches as the basis</title>
      <p>When releasing AI-models, it is common practice to provide documentation with it, about
the model’s architecture, the (type of) data it was trained on and evaluated with, and its
intended use. Such documentation fosters transparency of AI models and serves as a basis
for assessment regarding compliance with legal requirements [5]. Documenting the
characteristics of the released model asks for explicating and motivating the choices that
have been made during development. Hence, such documentation approaches foster
reflection on these choices. Therefore, documentation approaches could serve as a solid
starting point for designing an instrument to facilitate making these choices in a responsible
way.</p>
      <p>Most documentation approaches that have been proposed focus on data and AI models,
rather than AI-systems or AI-based applications. Therefore, we chose to base our
instrument on a seminal approach for documenting AI models, the model card [4]. The
model card approach was proposed as a framework to report on model performance
characteristics and to clarify which use cases the released machine learning model is and is
not intended for. An appealing characteristic of the model card is that it asks for a
description of contextual factors: the variety in groups, instrumentation, and environmental
factors that the model has been evaluated on. Addressing and explicating this variety can
spur reflection on inclusion and diversity during development.</p>
      <p>The model card is an example of an information sheet: a structured collection and
presentation of information on different technical and non-technical aspects. Micheli and
colleagues have identified three other main categories of documentation approaches:
questionnaires, composable widgets and checklists. For the purpose of guiding
development, and prompting discussion and reflection, questionnaires and checklists are
generally more appropriate than information sheets [6]. Especially questionnaires provide
more in-depth coverage and hence encourage solid reflection about the use and potential
misuse of the AI-model or system under consideration [5].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Development of the question matrix</title>
      <p>As argued above, the model card structure provided a solid basis for an instrument to guide
the development of AI-applications based on foundation models. This basis had to be
expanded for two reasons. First, to suit AI-powered applications rather than AI models only,
additional categories were needed to address the deployment and implementation of such
applications. Second, to adjust the instrument for the purpose of providing guidance during
development, rather than post-development documentation only, we reshaped it into a
question matrix instead of an information sheet. In the next two subsections, we elaborate
on these two adjustments.</p>
      <sec id="sec-3-1">
        <title>3.1. Additional categories for AI-powered applications</title>
        <p>The model card structure consists of nine categories: Model details, Intended use, Factors,
Metrics, Evaluation data, Training data, Quantitative analyses, Ethical considerations, and
Caveats and Recommendations. Except for model details such as model date and version, all
these categories are relevant for the purpose of providing guidance during application
development. Inspiration for additional categories to address deployment and
implementation of AI-applications was drawn from two dominant frameworks for
AIdeployment and integration: CRISP-DM [7, 8] and ML-Ops [9].</p>
        <p>The CRISP-DM cycle consists of six phases: Business Understanding, Data Understanding,
Data Preparation, Modeling, Evaluation and Deployment. While Data Preparation and
Modeling were judged to be fully addressed by the model card structure, for all other phases
additional items were needed. For Business Understanding, additional items concerned the
use case, and more specifically the aim of developing the application, specific tasks of the
application, and the context in which the application was te be used. Furthermore, an item
was added on the intended role of the application in the users’ daily working processes. For
Data Understanding we decided to add an item on data quality. For the Evaluation phase,
we added a more general evaluation item besides the technical metrics for model
performance, to evaluate whether the application indeed is appropriate for the task it was
intended for. Finally, Deployment was not yet addressed in the model card, so items
regarding maintenance and the embedding in the organization’s software systems were
added.</p>
        <p>From the ML-Ops perspective, two additional themes where identified: future
monitoring of model performance and addition of new data. Therefore, items addressing
future monitoring and training new model versions were added.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Shaping the instrument into a question matrix</title>
        <p>The model card consists of a list of items, divided into several categories. To prompt
discussion and reflection, we reshaped the items into questions. Furthermore, we included
multiple columns, thus shaping the instrument as a question matrix rather than as a
questionnaire. The matrix consists of five categories: 1) Intended use, 2) Model properties,
3) Training, model performance and application performance, 4) Scope of the application
(contextual factors), and 5) Implementation, maintenance and development. The first
column resembles the model card: by answering the questions, developers give an overview
of the current status of the AI-application under development. The second column asks to
motivate the choices that have been made and to specify considerations that led to certain
choices. The third column asks developers for the alternatives that they are considering or
have considered during development.</p>
        <p>The obtained question matrix was presented to two experts in the field of AI. They
suggested that addressing internal organization, especially stakeholders that are to make
decisions regarding implementation, would be useful, as these factors could also influence
the choices that developers make. Adding these questions resulted in the final question
matrix, of which all questions are presented in Appendix A.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. First experiences with the question matrix</title>
      <p>The question matrix was first piloted in three Dutch media organizations. In each of these
organizations a foundation-model-based application was developed. During the
development, in each organization the first author conducted a one-hour interview with an
involved AI-developer, using the question matrix as interview guide. In one project, a
foundation model was finetuned to be adapted for a specific purpose. The other two projects
concentrated on using foundation models as offered and evaluating their performance on
the organization’s data for specific purposes.</p>
      <p>In all interviews approximately half an hour was needed for specifying the intended use
and the datasets needed for using or finetuning the foundation models. For these topics, the
interviewees generally knew which alternatives had been considered and how choices had
been made. They also were clear about their choices for the foundation models that had
been selected for experimentation and development.</p>
      <p>Concerning evaluation metrics, added value, scope, implementation, maintenance and
development, their answers were less clear and complete. By analyzing the interview
transcripts, three types of less concrete answers were identified. First, interviewees seemed
to explicate ideas for the first time during the interview. For example, interviewees used
phrases such as “Now that I think of it” and “We didn’t mention it explicitly, but I think so.”
In multiple cases, this happened for the questions concerning what was in scope and
outof-scope for the application. Interviewees did not seem to have addressed this in their
discussions with colleagues, but did appear to have implicit ideas about what was beyond
the scope of their application, which they explicated during the interviews. Second,
interviewees identified topics that had not been addressed yet in development and needed
attention. This was especially the case for more technical topics such as the use of specific
evaluation metrics and the way in which cross-validation could or should be used in the
finetuning procedure. An interviewee pondered that “maybe these are questions that we
should take into the organization”, expressing a realization that more attention for these
topics was needed and fellow developers and other stakeholders within the organization
should be involved. Third, interviewees started developing new ideas during the interview.
This especially happened in an interview with two interviewees, were answers by one
interviewee seemed to ignite new ideas in the other. This shows that using the question
matrix in development teams may help teams to explicate ideas, develop a shared
understanding of these ideas and build on each others ideas.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and future research directions</title>
      <p>In this paper we presented a question matrix that is aimed at helping developers explicate
their options and the consequences of their choices repeatedly during development of
foundation-model-based applications. The question matrix is based on seminal approaches
for documenting AI-models, and adjusted to apply to AI-applications by drawing from
literature on CRISP-DM and ML-Ops. First experiences with the question matrix show that
it indeed seems to encourage discussion and reflection during development. To exploit this
potential, we envision that developers fill in the question matrix repeatedly during the
development and deployment of a foundation-model-based application, for example at the
beginning and halfway through the development project, towards the deployment phase
and repeatedly during deployment.</p>
      <p>We conjecture that filling in the question matrix also serves well as a documentation
approach, especially within organizations. It fosters transparency of these applications and
could enable easier reuse of data, (foundation) models and architectures for other purposes
within the organization. Further research is needed to address this potential.</p>
      <p>Another direction for future research is the completeness of this question matrix.
Organizations express a desire that an instrument like this may help them avert or mitigate
future risks, such as dependence on Big Tech companies and bias caused by foundation
models. Using, for instance, separate ethics checklists may feel like an extra burden.
Therefore, in the question matrix we have aimed to address AI-application development
from multiple perspectives and throughout its lifecycle, to obtain a sense of completeness.
Future research is needed to further develop and assess this completeness, for example by
aligning the instrument with the practice of regulatory oversight, as will be required by the
AI Act. As regulatory oversight may differ between sectors, this may lead to tailored
question matrices for different sectors. Hence, evaluation of the question matrix and its
completeness in various sectors is also a promising venue towards more responsible
implementation of foundation-model-based applications.</p>
    </sec>
    <sec id="sec-6">
      <title>A. The questions in the question matrix</title>
    </sec>
    <sec id="sec-7">
      <title>Questions in the Question matrix</title>
      <sec id="sec-7-1">
        <title>Which selection criteria for including data in the training data set are used?</title>
        <sec id="sec-7-1-1">
          <title>Developers</title>
          <p>Who is developing the AI-application?
Which parts of the application are developed within your organization
and which parts are developed by other parties?</p>
        </sec>
        <sec id="sec-7-1-2">
          <title>Training, model performance and application performance</title>
        </sec>
        <sec id="sec-7-1-3">
          <title>Metrics</title>
          <p>Which metrics does your organization use to evaluate the model?
To what extent do you identify and monitor metrics for various groups
or categories (also see Scope of the application)?
If applicable: which decision tresholds are being used?
Which amount of variation is present in the values of the evaluation
metrics?
How do you evaluate whether the application indeed is appropriate for
the tasks you have identified for it?</p>
        </sec>
        <sec id="sec-7-1-4">
          <title>Training procedure - only needed to answer when finetuning or adjusting an existing model, not when using an existing model as is within the application</title>
          <p>What does the training procedure look like?
(In which way) is cross-validation used?
Do you combine results of multiple runs?</p>
        </sec>
        <sec id="sec-7-1-5">
          <title>Evaluation data</title>
          <p>Which dataset(s) is/are used to evaluate the model? application?
How are the evaluation dataset (and its annotation) created?
How is the evaluation data preprocessed?
How do you make sure the evaluation dataset is appropriate for
evaluation (taking into account contextual factors and
representativity)?</p>
        </sec>
        <sec id="sec-7-1-6">
          <title>Scope of the application (contextual factors)</title>
        </sec>
        <sec id="sec-7-1-7">
          <title>Groups</title>
          <p>For which different groups (e.g. cultural, demographic, phenotypic)
should the application perform?
How are these different groups taken into account in training data,
training procedure and evaluation?</p>
        </sec>
        <sec id="sec-7-1-8">
          <title>Instrumentation</title>
          <p>For which variation in instrumentation (e.g. image quality, sound
quality) should the application perform?
How are these different instrumentations taken into account in training
data, training procedure and evaluation?</p>
        </sec>
        <sec id="sec-7-1-9">
          <title>Environment</title>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>For which variation in environmental factors (e.g. light, weather conditions) should the application perform? How are these different environmental factors taken into account in training data, training procedure and evaluation?</title>
        <sec id="sec-7-2-1">
          <title>Implementation, maintenance and development</title>
        </sec>
        <sec id="sec-7-2-2">
          <title>Implementation</title>
          <p>How is the application supposed to be implemented within the
organization?
Who decides on actual implementation?</p>
        </sec>
        <sec id="sec-7-2-3">
          <title>Maintenance</title>
          <p>How is the application supposed to be maintained?
Who are involved in maintaining the application?
How will be monitored whether the model keeps performing as
intended and whether model drift or model shift occurs?
What is your plan for identifying and mitigating risks?</p>
        </sec>
        <sec id="sec-7-2-4">
          <title>Development</title>
          <p>How and how often is a new version of the model trained?
How do you handle newly available data?
Who are involved in further development of the application?</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <year>2021</year>
          , doi: 10.48550/arxiv.2108.07258.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>C. Zhou</surname>
          </string-name>
          et al.,
          <article-title>“A Comprehensive Survey on Pretrained Foundation Models: A History from</article-title>
          BERT to ChatGPT,” Feb.
          <year>2023</year>
          , Accessed: Mar.
          <volume>21</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/2302.09419
          <string-name>
            <given-names>S.</given-names>
            <surname>Leijnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Aldewereld</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. van Belkom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bijvank</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ossewaarde</surname>
          </string-name>
          , “
          <article-title>An agile framework for trustworthy AI</article-title>
          .,” NeHuAI@ ECAI, pp.
          <fpage>75</fpage>
          -
          <lpage>78</lpage>
          ,
          <year>2020</year>
          , Accessed: Apr.
          <volume>15</volume>
          ,
          <year>2024</year>
          . [Online]. Available: https://www.academia.edu/download/76467298/leijnen.pdf M. Mitchell et al.,
          <article-title>“Model cards for model reporting</article-title>
          ,
          <source>” FAT* 2019 - Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency</source>
          , pp.
          <fpage>220</fpage>
          -
          <lpage>229</lpage>
          , Jan.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <year>2019</year>
          , doi: 10.1145/3287560.3287596.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          25, no.
          <issue>4</issue>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          .
          <year>2023</year>
          , doi: 10.1007/S10676-023-09725-7.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>M. A. Madaio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Stark</surname>
            ,
            <given-names>J. Wortman</given-names>
          </string-name>
          <string-name>
            <surname>Vaughan</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Wallach</surname>
          </string-name>
          , “
          <article-title>Co-Designing Checklists to Understand Organizational Challenges and Opportunities around Fairness in AI,”</article-title>
          <source>in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems</source>
          , New York, NY, USA: ACM, Apr.
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          . doi:
          <volume>10</volume>
          .1145/3313831.3376445.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Schröer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kruse</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Gómez</surname>
          </string-name>
          , “
          <article-title>A systematic literature review on applying CRISP-DM process model</article-title>
          ,
          <source>” Procedia Comput Sci</source>
          , vol.
          <volume>181</volume>
          , pp.
          <fpage>526</fpage>
          -
          <lpage>534</lpage>
          ,
          <year>2021</year>
          , doi: 10.1016/j.procs.
          <year>2021</year>
          .
          <volume>01</volume>
          .199.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Chapman</surname>
          </string-name>
          , “
          <article-title>CRISP-DM 1.0 Step-by-step data mining guide</article-title>
          ,”
          <year>2000</year>
          , Accessed: Mar.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          22,
          <year>2024</year>
          . [Online]. Available: https://api.semanticscholar.org/CorpusID:59777418
          <string-name>
            <given-names>D.</given-names>
            <surname>Kreuzberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kuhl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hirschl</surname>
          </string-name>
          , “
          <article-title>Machine Learning Operations (MLOps): Overview</article-title>
          , Definition, and Architecture,
          <source>” IEEE Access</source>
          , vol.
          <volume>11</volume>
          , pp.
          <fpage>31866</fpage>
          -
          <lpage>31879</lpage>
          ,
          <year>2023</year>
          , doi: 10.1109/ACCESS.
          <year>2023</year>
          .
          <volume>3262138</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>