<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>July</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Model with Emergent Com munication Framework for Explainable AI</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Farnoosh Javar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kei Wakabayashi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Tsukuba</institution>
          ,
          <addr-line>Ibaraki</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>Interpretable machine learning seeks to enhance transparency in model decision-making, particularly in highstakes applications. Concept bottleneck models (CBMs) improve interpretability by using human-defined concepts as intermediate representations. Yet, they often depend on extensive manual annotations and may fail to capture relevant features beyond predefined concepts. We propose an iterative communication emergence framework for interpretable machine learning that integrates a concept bottleneck model with data-driven discovery of latent features. Our approach employs a sender-receiver architecture, where the sender encodes raw inputs into discrete latent signals refined via reinforcement learning, and the receiver uses these latent concepts to predict outcomes. Latent representations are aligned post hoc with human-observable concepts, which are automatically generated by a language model and validated statistically, enabling transparent explanations while reducing reliance on manual annotations. Experiments on a cat breed classification task demonstrate that our framework maintains high predictive performance while progressively refining interpretable concept representations. Results suggest that emergent latent concepts can meaningfully align with human-understandable attributes, facilitating more lfexible and scalable interpretability in deep learning models.</p>
      </abstract>
      <kwd-group>
        <kwd>Concept bottleneck models</kwd>
        <kwd>Emergent communication</kwd>
        <kwd>Multi-agent communication</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Interpretable machine learning is essential for high-stakes decision-making, where understanding model
behavior is critical. Rather than relying on post-hoc explanations of black-box models, researchers
advocate for inherently interpretable models [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Large language models (LLMs) increase interpretability
challenges, particularly in specialized domains such as health care and justice, where transparency
and domain expertise are crucial. Their opaque reasoning process undermines trust in AI-driven
decision-making, necessitating frameworks that enhance both accuracy and interpretability [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        Concept bottleneck models provide a structured approach to interpretability by introducing an
intermediate layer of human-defined concepts, enabling direct user intervention. These models achieve
competitive performance while enhancing transparency by allowing users to modify concept values.
However, CBMs are constrained by their reliance on manual concept annotations, which can be costly or
infeasible in specific domains. Additionally, they often underperform compared to unconstrained models,
limiting their practical adoption [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. CBMs also limit model flexibility by enforcing predictions based
solely on predefined concepts, restricting the model’s ability to capture latent concepts that may exist in
the data but are not explicitly defined. Additionally, soft concept predictions can introduce information
leakage, where intermediate representations inadvertently encode task-specific information beyond
the intended concepts, reducing interpretability. Balancing predictive performance and transparency
continues to pose a central challenge in the field of interpretable AI [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        To address these limitations, post-hoc concept bottleneck models (PCBMs) have been proposed to
transform pre-trained neural networks into CBMs without sacrificing performance. PCBMs facilitate
external concept integration and enable eficient model editing to mitigate dataset biases [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Another
Late-breaking work, Demos and Doctoral Consortium, colocated with the 3rd World Conference on eXplainable Artificial Intelligence:
      </p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
and a reinforcement learning loop iteratively refines the concept representations.
promising approach involves automatically discovering latent concepts from model representations,
reducing reliance on predefined annotations while maintaining interpretability and narrowing the
accuracy gap between interpretable and black-box models [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Recent advancements leverage LLMs to generate human-readable concepts automatically. For
example, the language in a bottle (LaBo) framework utilizes GPT-3 to replace manual concept annotations
with natural language descriptions. However, this approach introduces challenges such as assessing the
validity of generated concepts and mitigating information leakage [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Additionally, recent research in
emergent multi-agent communication investigates how AI agents develop internal languages to
coordinate and solve tasks collaboratively. These emergent languages can encode task-relevant abstractions
that are not explicitly programmed, potentially revealing novel representations. However, aligning
these representations with human semantics remains an open challenge, as the symbols and structures
developed by AI agents may not directly correspond to human-understandable concepts, requiring
further interpretability methods [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In this work, we propose a discrete emergent language framework
that integrates aspects of concept bottlenecks with data-driven discovery of latent features. Instead of
relying solely on a fixed set of human-defined concepts, our approach enables a sender–receiver pair of
modules to develop a symbolic communication system through iterative training, subject to a discrete
bottleneck. The language is considered emergent because the meanings of these symbols are not
explicitly predefined by humans; instead, they develop as the system optimizes sender–receiver interactions
for the given task. By constraining communication to a discrete channel, the model’s internal reasoning
is structured around human-interpretable latent concepts when alignment with known human concepts
occurs. When a learned latent concept does not correspond to any predefined human concept, it is
retained as a potentially novel feature. These latent factors remain part of the communication protocol
and can later be analyzed by domain experts to assess their relevance or meaning. Our framework
seeks to balance predictive performance with interpretability, allowing models to capture meaningful
latent patterns while mitigating the limitations of predefined concept bottlenecks.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Approach</title>
      <p>
        We propose an iterative language emergence framework within a concept bottleneck model pipeline to
achieve interpretable and scalable AI. The framework consists of two cooperative agents—a sender and a
receiver—that communicate through discrete signals of latent concepts. The sender converts raw inputs
into binary latent concept vectors, while the receiver, implemented as a simple linear classifier, maps
these vectors directly to predicted labels. A reinforcement learning-based feedback loop using proximal
policy optimization (PPO) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] iteratively refines these representations to improve both accuracy and
interpretability (see Fig. 1).
      </p>
      <sec id="sec-2-1">
        <title>2.1. Problem Definition</title>
        <p>We address an image classification task with training dataset {( () ,  () )}

=1 where  () is an image,  () is
the associated class label, and  denotes the training dataset size. In addition, we have access to a set of
 
 
human-observable concepts (HOCs) that can provide interpretable annotations (e.g., “slender body” or
“no visible whiskers”). Let  = { 1, … ,   } be the set of natural language descriptions of the HOCs, and

() = { 
()
1 , … ,  

() } be binary values indicating the presence or absence of the  -th HOC in
the  -th image. We assume that LLMs can be leveraged to automatically generate candidate HOCs  and
() from class descriptions, domain-specific glossaries, or other textual corpora by prompting with
questions such as “What are the visual features of  ?’’ (where  is the class label) and “does the image
 () have the visual feature   ?”, respectively. In contrast to traditional CBMs, this assumption does
not require concept generation by human domain experts. We emphasize that, however, the available
HOCs are not necessarily efective features for the classification task.</p>
        <p>
          Our goal is to develop a model that classifies images accurately and ofers transparent reasoning based
on a combination of discrete interpretable concepts. The discrete binary representation of HOCs (i.e.,
hard concepts) enhances interpretability and prevents information leakage [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Efectiveness is evaluated
along two axes: classification accuracy and the extent to which human supervision—particularly in the
form of HOC annotations—is required at test time. Reducing this annotation burden, even when HOCs
are initially bootstrapped via LLMs, is desirable for scaling interpretable AI systems to new instances.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Sender–Receiver Framework and Emergent Communication Protocol</title>
        <p>The sender is a feedforward neural network (FNN)   parameterized by  , which processes raw input
features  () (e.g., an image of a cat) and outputs a vector of  discrete latent concepts,  () = ( 1() , … ,  
() ).</p>
        <p>
          These latent concepts represent task-relevant patterns that emerge from data. To enforce discreteness,
the continuous outputs of the FNN are thresholded using a predefined threshold  :

 () = {
1, if   ( 
0, otherwise,
() ∣  () ) &gt;  ,
where   ( 
() ∣  () ) is the predicted probability that concept  
() is active for a given input  () . This
binarization yields a vector (e.g.,  () = [
          <xref ref-type="bibr" rid="ref1 ref1 ref1">1, 0, 0, 1, 1</xref>
          ]), which serves as the message sent to the receiver.
        </p>
        <p>The set of discrete latent concept symbols forms a communication protocol between the agents. Each
  can be seen as a “word” in the emergent communication that the sender uses to describe the input’s
relevant properties.</p>
        <p>The receiver agent   , parameterized by  , is a simple linear classifier that takes the sender’s message
 () as input and outputs a prediction  ̂ () of the final class label (e.g., the breed of a cat, such as “Persian”).
The receiver is essentially the label predictor in the CBM pipeline: it maps from latent concepts to the
target output. Because we ensure that any information used by the receiver to make the prediction must
pass through the bottleneck of discrete concepts (intended to be human-understandable), its decision
process is more transparent than a direct end-to-end model.</p>
        <p>We optimize the sender’s policy to communicate useful concepts using reinforcement learning (RL).
The sender is updated with a policy gradient method – specifically, proximal policy optimization (PPO)
– to maximize the expected reward. The reward for the sender’s action (i.e., message  () ) is computed
after the receiver predicts a label  ̂ () as follows:
 = {
+ ⋅ max( )̂,
− ⋅ max( )̂,
if arg max( )̂ =  ,
otherwise,
where max( )̂ represents the maximum logit for the predicted class,  is a positive scaling factor for
correct predictions, and  controls the penalty for incorrect classifications. At each training iteration,
we sample a batch of inputs, let the sender produce messages  , and let the receiver predict labels  .̂
Then, the receiver’s parameter is trained via supervised learning using cross-entropy loss, and the
sender’s parameter  is updated by the PPO algorithm. This training process can be seen as the agents
jointly evolving a more efective shared communication protocol.
(1)
(2)</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Aligning Latent Concepts with Human-Observable Concepts</title>
        <p>the  ℎ
concept to a HOC.</p>
        <p>While the latent concepts emerge autonomously during training, we align them with HOCs in a post
hoc fashion using statistical validation. The proposed algorithm consists of  
the  -th cycle consists of three steps: (1) training the receiver for  ℎ
iterative cycles, where
using supervised learning and
updating the sender via PPO during each epoch, (2) choosing the best parameter pair ( () ,  () ) among
epochs based on validation accuracy, (3) establishing a new alignment mapping from a latent
and {</p>
        <p>In the step (3), the proposed method generates the latent concept vector  () from the sender   () for
each image  () in the training set. For each pair of latent concept   and HOC dimension HOC , we
construct a contingency table of co-occurrence statistics between the presence vectors {  (1), … ,  
( ) }
(1), … ,  
( ) } and apply Fisher’s exact test to calculate the  -value of the dependence. We

take the pair (, ) such that the  -value is lowest and establish a mapping from the latent concept   to
HOC if it indicates a statistical significance (i.e.,  &lt; 0.1 ).</p>
        <p>
          Once a mapping from   to HOC is established, we substitute the value of   with the corresponding
HOC annotation HOC in all subsequent processes without modifying the model’s internal computation
or predictions. For example, consider a scenario where the agents are trained to classify images of cats,
and  2 is aligned with the HOC1 “long whiskers” at the end of the first cycle. If the sender outputs
 () = [
          <xref ref-type="bibr" rid="ref1 ref1 ref1 ref1">1, 1, 0, 1, 1</xref>
          ] for an image  () of cat without long whiskers,  2
() is replaced with HOC(1) = 0, and
the receiver will receive the modified vector [
          <xref ref-type="bibr" rid="ref1 ref1 ref1">1, 0, 0, 1, 1</xref>
          ] during subsequent training and evaluation.
This mapping allows the model’s decisions to be interpreted in terms of known concepts when available,
or left as abstract latent factors when no significant alignment is found.
        </p>
        <p>
          The alignment mapping from latent concepts to HOCs is cumulative; once a latent concept is aligned
to a HOC, that mapping is retained in all subsequent cycles. Suppose that  5 is aligned with HOC8 “flat
nose” in the next cycle of the example above. If the sender outputs  () = [
          <xref ref-type="bibr" rid="ref1 ref1 ref1">1, 1, 0, 0, 1</xref>
          ] for an image of a
cat with a flat nose and no long whiskers ( HOC1 = 0 and HOC8 = 1), the modified vector [
          <xref ref-type="bibr" rid="ref1 ref1">1, 0, 0, 0, 1</xref>
          ] will
be sent to the receiver. The receiver predicts the class label, now with a more interpretable explanation.
This refinement process accumulates aligned concepts while preserving predictive performance.
        </p>
        <p>Over successive iterations, this cycle progressively refines both the predictive power and
interpretability of the model. Latent concepts that prove useful for the task and show consistent alignment
with HOCs become more semantically meaningful, while unaligned concepts may remain the model’s
newly discovered factors as novel concepts, which can be studied further by experts or potentially
added to the set of HOCs for future iterations. Importantly, the sender’s parameters remain persistent
across iterations, enabling stable concept refinement, while the alignment mapping evolves to capture
emerging associations. This design allows the model to maintain high task accuracy while ofering
increasingly transparent, concept-based explanations for its decisions.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>
          We evaluate our framework on a subset of the Oxford-IIIT-Pet dataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] containing four cat breeds:
Ragdoll, Persian, Sphynx, and Russian Blue. ResNet50-extracted features are used as input
representations. To avoid noise from LLM-generated concepts, we adopt a two-step verification: candidate
HOCs are first generated via generative AI and then manually annotated for reliability. This setup
enables robust alignment between latent concepts and human-observable attributes, facilitating rigorous
evaluation of our method. The final dataset comprises 400 images, each annotated with 26 binary HOCs
( = 26 ), and is divided into training ( = 320 ), validation, and test sets using an 80/10/10 stratified
split.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental Configuration</title>
        <p>Our experiments follow the iterative training procedure described in Section 2. The sender and receiver
are trained over five iterative cycles (   = 5), each consisting of 30 epochs ( ℎ = 30), with a
batch size of 32. PPO settings include n_steps = 64, a batch size of 16, and a learning rate of 3 × 10−4.
All other hyperparameters follow the default values in stable-baselines3 (v2.5.0).</p>
        <p>The alignment between latent concepts and HOCs is updated at the end of each iterative cycle
using Fisher’s exact test, applying a significance threshold of  &lt; 0.1 . Model performance is evaluated
via classification accuracy and cross-entropy loss on both validation and test sets. We also provide
qualitative analysis by inspecting the translated, human-readable concept vectors.</p>
        <p>To evaluate the framework’s full potential under controlled conditions, we isolate the model’s behavior
from noise introduced by LLMs. While candidate HOCs are automatically generated via a large language
model, human involvement is restricted to annotating the presence or absence (0/1) of each predefined
concept. This preserves the automation of concept discovery while ensuring label quality.</p>
        <p>For comparison, we trained a traditional CBM using the same label predictor architecture as our
proposed method. To evaluate the efect of concept supervision, we trained CBMs using 5, 10, and 15
randomly selected HOCs. Each model was trained for 150 (=   ×  ℎ ) epochs using a combined
loss function: binary cross-entropy for HOC prediction and cross-entropy for label prediction. We
repeated each configuration across five random seeds and report the mean test accuracy.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experimental Results and Discussion</title>
        <p>We report results from a single trial of the proposed iterative framework, which progressively aligned
latent concepts with HOCs over five training iterations. In each iteration, one statistically significant
alignment was identified via Fisher’s exact test (  &lt; 0.1 )—e.g.,  15 with HOC21 (slender body) in
Iteration 1 and  16 with HOC18 (long, tapered tail) in Iteration 5. Full alignments and metrics are
shown in Table 1. These results indicate that the framework refines latent structures over time, aligning
emergent concepts with HOCs while maintaining high predictive accuracy. Unlike traditional CBMs
that rely on predefined concepts, our method identifies meaningful associations post hoc, enabling
more flexible and scalable interpretability.</p>
        <p>Validation accuracy increased from 85% to 97.5%, and test accuracy from 75% to 100% over iterations.
Validation loss declined from 0.8490 in iteration 1 to 0.3675 in iteration 5, with minor fluctuations across
epochs. These metrics indicate improved performance over time, although gains were not strictly
monotonic. This suggests a potential relationship between concept alignment and model confidence.
As training progressed, more stable predictions were observed—likely enabled by PPO-based
senderreceiver optimization, which allowed latent representations to evolve without causing abrupt changes
in downstream accuracy. However, whether such stability generalizes to datasets with more complex
feature distributions remains an open question.</p>
        <p>To further examine the relationship between latent units and human-interpretable concepts, we
visualize sample groupings in Figure 2. Subfigure 2a compares latent concept  5 with HOC24, and
(a)  5 and HOC24 (Slim oval paws).</p>
        <p>(b)  17 and HOC16 (no visible whiskers).</p>
        <p>Subfigure 2b compares  17 with HOC16. In each panel, the top two rows are grouped by HOC values,
and the bottom two by latent activations. For  5, some visual consistency is observable with the paw
shape associated with HOC24. For  17, activated samples tend to belong to a consistent breed (e.g.,
Russian Blue), though the whisker attribute associated with HOC16 is not always present. We discuss
this type of partial alignment and its implications in Section 4. These examples reflect the post hoc
nature of alignment in our framework—some latent features align well with visual attributes, while
others only partially capture the associated HOC. This underscores the need for more robust validation
methods to assess the semantic coherence of emergent representations.</p>
        <p>The traditional CBM achieved a mean test accuracy of 71.5% when trained with 5 randomly selected
HOCs. Increasing the number of HOCs to 10 and 15 improved the mean test accuracy to 96.5% and
99.5%, respectively. These results indicate that strong performance in the traditional CBM setting
is dependent on supervision from a suficiently large and informative concept set. In contrast, our
framework achieved 100% test accuracy by the final iteration while relying on only 5 HOC alignments
selected through iterative discovery. Performance trends are illustrated in Figure 3. This highlights
its ability to dynamically refine concept representations based on data, without requiring full concept
annotations. The progressive nature of alignment, combined with strong classification performance,
suggests potential advantages in generalization and interpretability when compared to fixed, manually
labeled bottlenecks.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Limitations and Future Directions</title>
      <p>While the iterative communication-emergence CBM framework shows promise, it has limitations in
concept coverage and consistency. In our experiments, only five latent concepts aligned confidently
with HOCs, leaving many unmapped due to weak associations or missing human-defined attributes.
Future work could cluster activations and analyze them via language models or human feedback to
expand or refine the HOC space. Some unmapped concepts may reflect novel patterns, but not all are
necessarily interpretable.</p>
      <p>
        We did not explicitly assess mapping consistency across training runs or datasets. Although aligned
concepts remained fixed once identified during our single trial, their stability under diferent data
distributions or longer training remains unverified. Initial assignments may vary due to dataset
properties or learning dynamics, potentially leading to concept drift. Addressing this may require more
lfexible alignment strategies and evaluations across multiple runs [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ].
      </p>
      <p>Limitations also arise in the scope of interpretability. Our evaluation relies on statistical correlations
with predefined HOCs, confirming whether a latent factor rediscovers a known concept, but not how it
is internally represented. The fixed binarization threshold (  = 0.5 ) may also afect which concepts align
but was not evaluated. Additionally, the sender is optimized for prediction accuracy, not concept-level
supervision, allowing spurious correlations to persist. Although aligned variables are filtered before
prediction, the sender may still generate them, with no mechanism to discourage reliance on
noninterpretable factors. Future work could examine the threshold’s efect, use visualization or attribution
methods to interpret latent factors, and incorporate causal or adversarial regularization to mitigate
spurious patterns. Expert review may help define new HOCs, and applying the framework to diverse
datasets will be essential to assess generalizability. The traditional CBM was not further fine-tuned.
While the comparison still highlights diferences in reliance on supervision, future work should explore
whether tuning CBM hyperparameters improves performance. In our PPO reward, we use the maximum
predicted logit as a confidence measure. While this supports interpretability, exploring alternatives
such as maximum softmax probability or output entropy remains a promising direction.</p>
      <p>A key limitation is that the current alignment strategy permits only many-to-one mappings from
latent concepts to HOCs, without supporting many-to-many relationships. In practice, multiple latent
concepts may redundantly align with the same HOC, while a single latent concept may also encode
multiple abstract or overlapping attributes. Post hoc alignment does not influence training in our
current setup. Incorporating soft alignment rewards could encourage interpretability as associations
emerge, but must be balanced against model flexibility to avoid the rigidity of traditional CBMs.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This work presents an iterative communication emergence framework for interpretable machine
learning, integrating reinforcement learning into a CBM to enable emergent concept discovery. Unlike
traditional CBMs that rely on predefined annotations, our approach allows the model to autonomously
align latent representations with human-understandable attributes through iterative refinement.</p>
      <p>Experiments on cat breed classification demonstrate that interpretability improves progressively
over training—latent concepts become increasingly aligned with observable features—while predictive
accuracy also improves. The model achieved perfect test performance by the final iteration, despite
relying on only five post hoc-aligned HOCs, without direct concept supervision. Aligned concepts
remained stable once discovered, suggesting robustness in alignment.</p>
      <p>While our framework enables latent concepts to align with human-interpretable attributes, it has
limitations. Many concepts remain unmapped, and the current many-to-one alignment may oversimplify
complex or overlapping patterns. Since alignment occurs post hoc, it does not guide training. Future
work could incorporate soft alignment objectives, adaptive mappings, and visualization techniques to
better integrate interpretability into the learning process. Ultimately, bridging emergent representations
with human semantics ofers a path toward more transparent and trustworthy AI systems.
The author(s) used GPT-4o and Grammarly for grammar and minor wording improvements and reviewed
and edited the content as needed. The author(s) take full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work was partly supported by JST CREST grant number JPMJCR22M2.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <article-title>Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>1</volume>
          (
          <year>2019</year>
          )
          <fpage>206</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Specia</surname>
          </string-name>
          ,
          <article-title>From understanding to utilization: A survey on explainability for large language models, arXiv preprint (</article-title>
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2401.12874. arXiv:
          <volume>2401</volume>
          .
          <fpage>12874</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Inala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Galley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Rethinking interpretability in the era of large language models, arXiv preprint (</article-title>
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2402.01761. arXiv:
          <volume>2402</volume>
          .
          <fpage>01761</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mussmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pierson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Concept bottleneck models</article-title>
          , in: H.
          <string-name>
            <surname>D. III</surname>
          </string-name>
          , A. Singh (Eds.),
          <source>Proceedings of the 37th International Conference on Machine Learning</source>
          , volume
          <volume>119</volume>
          <source>of Proceedings of Machine Learning Research</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>5338</fpage>
          -
          <lpage>5348</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Havasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Parbhoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Doshi-Velez</surname>
          </string-name>
          ,
          <article-title>Addressing leakage in concept bottleneck models</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>23386</fpage>
          -
          <lpage>23397</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yuksekgonul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <article-title>Post-hoc concept bottleneck models</article-title>
          ,
          <source>in: Proceedings of the International Conference on Learning Representations</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schrodi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Argus</surname>
          </string-name>
          , T. Brox,
          <article-title>Concept bottleneck models without predefined concepts</article-title>
          ,
          <source>arXiv preprint</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2407.03921. doi:
          <volume>10</volume>
          .48550/arXiv.2407.03921. arXiv:
          <volume>2407</volume>
          .
          <fpage>03921</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panagopoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yatskar</surname>
          </string-name>
          ,
          <article-title>Language in a bottle: Language model guided concept bottlenecks for interpretable image classification</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>19187</fpage>
          -
          <lpage>19197</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR52729.
          <year>2023</year>
          .
          <year>01839</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lazaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          ,
          <article-title>Emergent multi-agent communication in the deep learning era, arXiv preprint (</article-title>
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2006</year>
          .02419. arXiv:
          <year>2006</year>
          .02419.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Klimov</surname>
          </string-name>
          ,
          <article-title>Proximal policy optimization algorithms</article-title>
          , arXiv preprint (
          <year>2017</year>
          ). URL: https://arxiv.org/abs/1707.06347. arXiv:
          <volume>1707</volume>
          .
          <fpage>06347</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Parkhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Jawahar</surname>
          </string-name>
          ,
          <article-title>Cats and dogs</article-title>
          ,
          <source>in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>3498</fpage>
          -
          <lpage>3505</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2012</year>
          .
          <volume>6248092</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Chaabouni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharitonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bouchacourt</surname>
          </string-name>
          , E. Dupoux,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          ,
          <article-title>Compositionality and generalization in emergent languages, in: Proceedings of the Annual Meeting of the Association for Computational Linguistics</article-title>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>4427</fpage>
          -
          <lpage>4442</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>407</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>407</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tallec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Grill</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Pietquin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Dupoux</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Strub</surname>
          </string-name>
          ,
          <article-title>Emergent communication: Generalization and overfitting in lewis games</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          <source>of NIPS '22</source>
          , Curran Associates Inc.,
          <year>2022</year>
          , pp.
          <fpage>16744</fpage>
          -
          <lpage>16760</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>