=Paper=
{{Paper
|id=Vol-3910/aics2024_p69
|storemode=property
|title=Towards Understanding Deep Representations in CNN: from Concepts to Relations Extraction via Knowledge Graphs
|pdfUrl=https://ceur-ws.org/Vol-3910/aics2024_p69.pdf
|volume=Vol-3910
|authors=Eric Ferreira dos Santos,Alessandra Mileo
|dblpUrl=https://dblp.org/rec/conf/aics/SantosM24
}}
==Towards Understanding Deep Representations in CNN: from Concepts to Relations Extraction via Knowledge Graphs==
<pdf width="1500px">https://ceur-ws.org/Vol-3910/aics2024_p69.pdf</pdf>
<pre>
                         Towards Understanding Deep Representations in CNN:
                         from Concepts to Relations Extraction via Knowledge
                         Graphs
                         Eric Ferreira dos Santos1,∗ , Alessandra Mileo1
                         1
                             Dublin City University, Collins Ave Ext - Whitehall, Dublin, Ireland


                                        Abstract
                                        The ability to understand deep representations from trained Convolutional Neural Networks (CNN ) in image
                                        classification tasks is still limited when it comes to effectively justifying the reasons behind a given outcome.
                                        This is due to the fact that most approaches focus on low-level features (such as pixels to generate saliency maps)
                                        while human understanding is based on concepts and relations among those concepts. To address this problem,
                                        we propose an approach that aims to extract high-level human concepts and their relations from deep learning
                                        models by combining disentangled representations and Commonsense Knowledge Graphs and relying on textual
                                        descriptions of visual relations as ground truth for evaluation. The concept extraction phase leverages Network
                                        Dissection as a disentangled representation approach to collect global and local concepts learned by a trained
                                        CNN combined with a linear classifier. The relation extraction step uses a CSKG as commonsense knowledge
                                        graph to find relations between those concepts. The visual relation dataset Visual Genome is used as a ground
                                        truth to validate the known relations. Based on relations coverage between the local and global features, our
                                        approach paves the way to understand what a CNN learned in a way that can be easily interpreted by humans,
                                        presenting the importance of specific concepts and relations for a given classification task.

                                        Keywords
                                        Explainable AI, Computer Vision, Knowledge Graph, Convolutional Neural Network


                         1. Introduction and Motivation
                         Deep learning models, specifically Convolutional Neural Networks (CNNs) [1], have revolutionised
                         computer vision applications such as image classification, object detection, and segmentation. While
                         these models have attained cutting-edge performance on a variety of tasks, their intrinsic complexity
                         and limited interpretability beyond visual cues pose substantial barriers to their use in real-world
                         applications where trustworthiness is key.
                            The field of Explainable AI (XAI ) has gained significant traction as researchers aim to develop reliable
                         methods to unveil the deep neural network’s decision process. In image classification, XAI techniques
                         help identify which image features are critical for a model’s outcome. Visual approaches, such as those
                         described in [2, 3] focus on relating outputs with image features, but they require considerable human
                         interpretation and can produce inconsistent results across different samples in the dataset. As a result,
                         alternative strategies that go beyond visual explanations, including textual justifications and feature
                         relevance, have emerged. These include approaches for combining textual and visual data, simplifying
                         complex models, and utilising human-understandable concepts [4].
                            One of the key challenges in supporting human understanding of deep neural networks is the ability
                         to effectively disentangle low level features from high level concepts. This has motivated approaches
                         such as the one in [5], where disentangled concepts are ranked and considered as high-level features
                         at both local and global level. In order to improve understanding of deep representation in image
                         classification tasks beyond mere concept ranking, we want to enable discovery of relations among
                         disentangled concepts. To this aim, this research paper proposes an approach that leverages external

                          AICS’24: 32nd Irish Conference on Artificial Intelligence and Cognitive Science, December 09–10, 2024, Dublin, Ireland
                         ∗
                           Corresponding author.
                          $ eric.ferreiradossantos2@mail.dcu.ie (E. F. d. Santos); alessandra.mileo@dcu.ie (A. Mileo)
                           0000-0002-0408-5756 (E. F. d. Santos); 0000-0002-6614-6462 (A. Mileo)
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
knowledge graphs as well as textual descriptions for discovery, ranking and validation of semantic
relations among concepts top ranked concepts.
   Our main contributions is a method that takes as input disentangled concepts learned from a CNN
trained for an image classification task, and leverages a commonsense knowledge graph to extract
meaningful relations among concept pairs. Our validation approach relies on the Visual Genome dataset1 ,
where concepts in images are related to their textual descriptions.
   The remainder of the paper is organised as follows: Section 2 presents related works on concept
extraction from CNN and the use of knowledge graph for modelling relations. Section 3 describes
our approach; Section 4 discusses our experimental evaluation; and Section 5 concludes by discussing
challenges and promising directions.


2. Related Work
When it comes to image classification as computer vision task, most published works so far focus on
the use of CNN architectures to achieve the best performance. A more recent approach that is recently
emerging, known as Vision Transformers (ViT ) [6], relies on attention mechanisms and their success in
natural language processing [7]. Although ViT outperforms CNN models in some cases, the difference
is not significant considering the additional time and data needed for training and the risk of increased
bias [8]. For these reasons, in this paper, we focus on using CNNs for our investigation.
   When it comes to understanding the inner working of the model, among the approaches for explain-
ability we focus on feature relevance [9]: specifically, we are interested in how to effectively translate
relevant features into semantic concepts which humans can understand, and how to leverage knowledge
graphs to elicit relations among such concepts.
   A concept refers to a high-level, human-understandable abstraction, such as the colour “black” or
the object “bicycle”, which are defined by humans and do not require any additional information for a
human to understand [10]. We can define a relation as a connection or association between two or more
concepts. This connection can be based on factors such as causality, similarity, proximity, or functional
dependence. For example, the concepts “bicycle” and “wheel” can be connected by a relation such as “a
wheel is part of a bicycle”, as well as “a bicycle contains a wheel.
   Several approaches for DNN -based Relation Extraction have been proposed in Natural Language
Processing [11]. However, the use of knowledge graph is still under-investigated as a promising direction
to get direct access to the relational knowledge for explanation generation and reasoning, as well as
obtain robust outcomes that are less dependent of the input data [12]. To the best of our knowledge,
none of these approaches have been explored for image data specifically. The use of knowledge graphs
to bridge the gap between visual concepts and semantic relations is not assuming the presence of textual
labels or captions (although we need them for evaluation), and this characterises the new angle of our
approach, but also determines the lack of a comparative baseline.
   For this reason, as part of related work, we focus on approaches that tackle each step individually,
namely: methods for concept extraction from deep models and the use of knowledge graphs to discover
relations between semantic concepts (mostly in text).

2.1. Concept Extraction
When it comes to explainability, it is necessary to go beyond feature-relevance [13], which focuses on
relating low-level features with high-level human concepts [14] assuming that each filter/neuron can
independently be responsible for learning one or more concepts.
   Network Dissection [15] is one of such approaches for disentangled representation, which assign
relevant labels to each filter of a CNN. To achieve that, authors rely on the Broden dataset, composed of
pixel-annotated low-level concepts such as colours and high-level concepts such as objects. A trained
CNN uses the Broden dataset to compare the binary map from each image with each filter activation

1
    https://homes.cs.washington.edu/~ranjay/visualgenome/index.html
map: if the convolutional filter is substantially activated in areas of the image containing a given
human-labelled concept, the filter is considered to be “looking for” that concept for a particular test
image.
   A different approach presented in [16] proposed a method to learn a decision tree which approximates
the hierarchy of concepts learned by a trained CNN. Unlike Network Dissection [15], this approach does
not rely on object-part annotations and is therefore only an approximation of the CNN decision process.
In addition, it can be subject to noise when filters are activated by unrelated visual concepts.
   The approach in [17] suggests investigating how specific human concepts impact classification
outcomes based on Concept Activation Vectors (CAV ). The general idea of CAV is to train a linear
(binary) classifier to separate each specific concept and use a directional derivative to assess concept
sensitivity for a particular class. This method identifies the relevance of a particular concept for the
selection of a class, but it is required that the user already knows which concepts are more descriptive
for that class to verify which ones affect a classification the most.
   Despite these approaches rely on exposing a set of (pre-defined or dataset dependent) human concepts
that can help identify what the model learned for a specific sample, they do not provide a global charac-
terisation of how the model generalises to an entire class. An improvement in this direction is presented
in [5], where authors combine the Network Dissection approach for disentangled representations with
a linear classifier to understand which concepts the model learned both locally and globally. Local
concepts are those related to each input image, while global concepts are generalised over an entire
class. Specifically, authors in [5] use Network Dissection to identify the top-k local concepts for each
image of a given class, and train a linear SVM classifier to collect the best features that were more
important for the classification.
   We believe human explanations are characterised by relations between concepts, as opposed to
concepts in isolation. To this aim, in this paper we focus on going from concepts to relations, and we
suggest doing that by leveraging approaches to relation extraction from knowledge graphs, applied
to a ranked list of concepts such as those generated in [5]. Relevant work on relation extraction and
knowledge graphs is presented in the next subsection.

2.2. Relation Extraction via Knowledge Graphs
A knowledge graph (KG) is a structured representation of knowledge containing entities, attributes,
and relations. KGs capture the meaning of unstructured data and provide a semantic framework for
intelligent systems [18]. KGs have been used in various applications, including question answering,
information extraction, and entity recognition [19]. They have also been used to combine data from
numerous sources, making complex data more accessible to reason about automatically. Machine
learning and artificial intelligence advancements have fuelled the creation of KGs from large-scale data
sources by using algorithms for knowledge extraction [20].
   Commonsense KGs are specifically useful for reasoning about entities and their relations, which
people consider intuitive, and this knowledge is critical for artificial intelligence applications because it
has the potential to help machines understand and reason about the world like humans do.
   The CommonSense Knowledge Graph (CSKG) [21] facilitates the use of such knowledge and it is a
resource that uses seven very diverse and disjoint sources: a text-based commonsense knowledge graph
ConceptNet 2 , a general-purpose taxonomy Wikidata3 , an image description dataset Visual Genome [22],
a procedural knowledge source ATOMIC 4 , and three lexical sources: WordNet 5 , Roget 6 , and FrameNet 7 .
Through the combination of different sources, the CSKG offers a variety of nodes (objects or concepts)
and edges (relations) in order to provide a common sense knowledge base for reasoning.

2
  https://conceptnet.io/
3
  https://www.wikidata.org/wiki/Wikidata:Main_Page
4
  https://huggingface.co/datasets/allenai/atomic
5
  https://wordnet.princeton.edu/
6
  https://www.gutenberg.org/ebooks/22
7
  https://framenet.icsi.berkeley.edu/
   KG have been employed to develop automatic methods for understanding the relations between
objects in a picture. For example, ConceptNet was used in the images following the classification work in
[23] to identify sample-specific relations. However, the method to align the concepts is mostly manual
[24], and it does not explain how the model learned those concepts and relations. Authors in [25]
also suggest an ontology-based approach to identify objects and their relations using a KG. However,
the KG alignment with the concepts and how the deep representation learned their relations is still
underinvestigated.
   Concept-based explanation approaches discussed earlier in this section provide numerous ways for
interpreting what the CNN model learned, but they do not provide any tool or approach for automatic
relation extraction, which we aim to achieve using Knowledge Graphs.
   This paper proposes a solution combining these two steps to go from concepts learned by a trained
CNN to relations via knowledge graphs. Building upon approaches for concept extraction such as those
in [15, 5] we aim to extract and validate relations among those concepts using CSKG as Commonsense
Knowledge Graph. Our approach is detailed in the next section.


3. Overall Methodology for Concepts and Relation Extraction and
   Validation
The proposed approach has three phases, as illustrated in Figure 1: concept extraction, relation extraction
and relation validation. We rely on Network Dissection as in [15, 5] for concept extraction, while CSKG 8
is used as knowledge graph to extract relations between concepts and the Visual Genome dataset is
used to validate learned relations. It is important to emphasise that each phase of the approach can
use different knowledge or dataset in order to extract the concepts and relations, as well as to validate
those relations. In the remainder of this section we describe the datasets and knowledge graph used in
our experiments, and we detail each of these phases separately.


Figure 1: Conceptual diagram of the approach


8
    https://cskg.readthedocs.io/en/latest/
3.1. Datasets and Knowledge Graph
The Broden dataset [15] contains 63,305 images9 with detailed annotations for objects, parts, materials,
and scenarios, making it ideal for associating filters with high-level concepts in the concept extraction
step. In our experiments, we replaced the last fully connected layer of a pre-trained model for action
recognition based on the Action40 dataset [26], which includes 9,532 images across 40 action classes;
we also use the well-known CIFAR-10 [27] dataset, containing 60,000 small images across ten classes, as
a benchmark for our approach performance.
   For Relation extraction, we rely on the CSKG (Commonsense Knowledge Graph) [21], which provided
a framework for investigating common-sense reasoning by exploring the relations between concepts,
with over 2 million concepts and 58 relations. The relations presented in this dataset are taxonomical and
lexically based, combined from different sources. In addition, the Visual Genome [22] dataset, containing
101,174 images which include around 42,000 distinct relations, was used for interpreting visual scenes,
with its extensive annotations of objects, relations, and scene attributes, aiding in validating relations
learned by the model.

3.2. Concept Extraction
In the concept extraction phase, we apply Network Dissection on a ResNet-152 model [28], trained on
ImageNet 10 similarly to [5], and we use transfer learning to adapt the model weights to the Action40
[26] and CIFAR10 [27] datasets. We identify the most relevant concepts for each input image by taking
the K highest-scoring filters in the last CNN layer, based on the mean of each activation map.
   The linear SVM classifier is then used to identify the top K features per class (referred to as global
concepts); we then evaluate which semantic concepts better separate classes based on feature significance.
Authors in [5] had experimentally identified K = 10 as the value obtaining the best (95%) precision for
both local and global concepts, therefore we used the same value for K.

3.3. Extracting Candidate Relations
Given the top K local and global concepts identified in the first phase, we query the CSKG knowledge
graphs for candidate relations among all combinations of concept pairs in the top K. This step relies on
query pattern matching via KGTK, a python library for KG manipulation11 . Note that in CSKG nodes
represent concepts and edges represent relations12 . In this preliminary investigation, and in order to
reduce the combinatorial complexity of this step, we only focused on direct relations between concepts:
the algorithm takes a concept pair as input and returns as output only direct edges between the nodes in
the concept pair. Note that at this point nodes are possible candidate relations based on commonsense
knowledge, and therefore we still need to verify whether the deep representation has learned such
relations. We do that as discussed in the next subsection.

3.4. Visual Validation of Candidate Relations
The Visual Genome dataset [22] contains pictures with tagged relations identified by bounding boxes.
In order to identify which relation our model is likely to have learned, we proceed as follows:
     1. for every candidate relations Ri among concepts C1 and C2 extracted from CSKG, we identify all
        images Ij in Visual Genome representing Ri 13 ;
     2. every image Ij for a given relation Ri is passed through the model used for concept detection,
        with the hypothesis that if the two relevant concepts C1 and C2 for Ri are among the top ten
        activated filters, the network has likely learned the corresponding relations;
9
 http://netdissect.csail.mit.edu/data/broden1_227.zip
10
   https://www.image-net.org/
11
   https://kgtk.readthedocs.io/en/latest/transform/query/
12
   https://github.com/commonsense/conceptnet5/wiki/Relations
13
   Here we use string matching between CSKG and Visual Genome concepts
     3. due to differences in relation names between CSKG and Visual Genome, we relaxed the exact string
        matching approach by comparing the results obtained with the use of Named-Entity Resolution
        (NER) approaches from Natural Language Processing (namely from spacy.io14 and NLTK 15 ) and
        the well known word2vec 16 .
   The reason we selected Visual Genome as a more robust approach to validation as opposed to Large
Language Models such as GPT-4 17 or Gemini18 , is the reduced risk of hallucinations [29], as well as
the ability to have access to a visual representation of relations that we could use as a ground-truth to
relate concepts to disentangled filters.


4. Experimental Evaluation
Our experiments were conducted on a machine running Linux Mint 21.2, with 48 CPUs and 128 GB of
RAM. We rely on two main Python libraries: Pytorch19 for training and testing the models and KGTK to
work with the knowledge graph. The code used in this research is available on the GitHub repository 20 .
   The concepts extraction phase on ResNet-152 pre-trained on the Imagenet dataset resulted in 162
different concepts including object, part, material, and colour. We started from this pre-trained model
and applied transfer learning, freezing all the trained layers and replacing the fully connected layers
with the linear classifier. This enabled us to apply concept extraction, relation extraction and validation
(with string matching and with Named-Entity Resolution or NER) on two different datasets, namely
Action40 and CIFAR-10.
   The results of our analysis are presented in Table 1, where we can see the number of global (# R
(Global)) and local (# R (Local)) unique relations extracted using CSKG, the percentage of the Visual
Genome images containing the K local and global concepts extracted from each dataset (% VG Images),
and the total number of unique relations learned using Visual Genome (#R (VG)). Note that the percentage
of images from Visual Genome, and the relations learned are not distinguished as local and global, as
they relate to the entire model when considering the overlap between local and global.

     Table 1
     Candidate local and global relations from CSKG, Percentage of the Visual Genome images that contain
     the same pair of concepts in local and global relations from CSKG, and Relations found using the Visual
     Genome only.

                                      # R (Global)     # R (Local)
                                                                     % VG Images   # R (VG)
                                         CSKG            CSKG
                        Action40           339            2,495         12%         3,434
                        CIFAR-10           137            2,176         14%         2,566

   Based on the relations learned (Table 1, last column), we can now look for relations that are present in
both CSKG and VG, both local and global. Table 2 shows these common relations found by simple string
matching (#R) and by relaxing the matching using NER approaches (#R_NER). We consider the sum of
unique relations extracted by any of the three NER approaches. We did not consider the contribution
of each single approach for this analysis. Table 2 also presents the percentage of relations validated
from Visual Genome more than once, that were also present in CSKG (%R_VAL) separated in global
and local relations respectively. We use this threshold (relation validated more than once) to increase
the likelyhood that the relation was actually learned. This means, for example, that starting from the
14
   https://spacy.io/models/en#en_core_web_lg
15
   https://www.nltk.org/index.html
16
   https://radimrehurek.com/gensim/models/word2vec.html
17
   https://openai.com/index/gpt-4/
18
   https://gemini.google.com/app
19
   https://pytorch.org/get-started/locally/
20
   https://github.com/EricFerreiraS/relation_extraction_AICS24
candidate global relations (Table 1 #R (Global)), extracted from the combination of the top K features
and the CSKG in the candidate relation extraction phase, only 8% were validated as learned by the
model for the Action40 dataset.

    Table 2
    The number of relations found, locally and globally (R), their relaxation with NER (R_NER), and the
    percentage of global and local relations candidates validated.

                                        # R_NER                  # R_NER     % R_VAL     % R_VAL
                         # R (Global)              # R (Local)
                                        (Global)                  (Local)    (Global)     (Local)
            Action40          7            51           23          166         8%          9%
            CIFAR-10          1            15           23          118         2%          3%


   We observe that the there is a high number of local relations, but when validated across the instances
of a class, only a few of those relations are likely to influence the classification task. Our method allows
us to identify such global relations reducing noise.
   In order to capture a more fine grained view of this phenomenon per class, we define the notion of
coverage which measures how many local relations (for all images of a given class) are also present
as global relations for that class. These global relations are relations among concepts specific to that
class, as they best separate that class’s feature space. If at least one relation is given, the globally rated
relations make sense with the local ones. The coverage formula is as follows:
                                                            P
                                                              Clc ,gc
                                           Coveragec =                                                      (1)
                                                             #Lc
where Coveragec is the coverage for a specific class c, Clc ,gc is the sum of the instances where the
                                                             P
local (lc ) and global (gc ) relations have at least one element in common for class c, and #Lc is the
number of local instances that belongs to the class c. Figures 2 and 3 present values of Coveragec for
each of the two datasets.
   This analysis helps clearly identify classes with very low coverage (such as “reading”) where the
identified local relations are not likely to be influencing the classification task, versus classes with high
coverage (such as “riding_a_bike”) where significant global relations are present in most of the local
instances. Figure 4 presents an example of how our method works for the class “riding_a_bike” in the
Action40 dataset.
   It is shown that, based on the top concepts extracted from the first phase, ten local relations based on
an instance of “riding a bike” and three global relations for the same label were selected as candidates.
In the validation step, images from Visual Genome, which contain the concepts presented in the relation
candidates, are used to verify which relation was learned by the model. In this case, only the “wheel
is part of a bicycle” was identified as having been learned. As the relation learned is presented both
locally and globally, we then define that the relation covers this case.
   We might be tempted to say that for classes with low coverage, the model is likely to not have
learned relations that are crucial for the classification tasks, and this might be used as a starting point to
investigate how to better train the model for those classes, for example by looking at class imbalances
or data augmentation techniques as well as knowledge injection mechanisms.
   However, we are aware that the values for coverage might also depend on other factors. For example
it might depend on how well the linear classifier identifies concepts that separates classes well. It might
also be due to the quality of the concept extraction approach based on Network Dissection, which in
turn might depend on the quality of the dataset. For a low-resolution dataset such as CIFAR-10, for
example, we observe that only classes “dog” and “bird” have high coverage. This observation would
need to be confirmed by conducting an ablation study to identify the quality of concepts and how they
affect the overall pipeline, and this is an area for further investigation.
   Furthermore, a deeper investigation would be required to identify the impact of different concept
extraction techniques (both global and local), before reaching a conclusion on how well the extracted
Figure 2: Coverage on Action40


relations are implicitly learned by the model.


5. Conclusion and Next Steps
Efforts to improve CNNs’ transparency have insofar largely focused on visual cues to highlight what
pixels in an image most influenced the prediction. While useful, these methods fall short in providing a
truly understandable and human-friendly explanation. In response to this limitation, we developed
an approach to extract and validate relations among disentangled concepts from CNN models trained
on image classification tasks. Our approach combining concept extraction techniques and knowledge
graphs paves the way towards a deeper understanding of trained CNNs’ in terms of concepts and
semantic relations among them, and therefore it has the potential to support human understanding of
how these black-box models reach their decisions, and potentially improve the learning process.
   We are aware that more investigation is required to reach this ambitious goal, and we have iden-
tified some limitations of our approach that we plan to extend in future work. One limitation of our
Figure 3: Coverage on CIFAR-10


method is its reliance on Network Dissection, which may not fully capture the complexities of human-
understandable concepts. This could be addressed by relying on larger datasets with pixel-level concept
labels, or by exploring alternative methods for disentangling concepts, such as Concept Activation
Vectors [17] or CLIP-Dissect [30]. Additionally, there is the need to measure consistency between
commonsense knowledge graph relations and the concepts derived from Network Dissection and Visual
Genome, and define additional metrics that can help identify the suitability of different knowledge
graphs. A broader comparison on the use of different image datasets and external knowledge graphs
would be beneficial to analyse how the method proposed in this work can adapt to different domains
and how the selection of validation data and knowledge graphs affects the results.
Figure 4: Relation Learned Example
Acknowledgments
Thanks to the financial support of Science Foundation Ireland Centre for Research Training in Artificial
Intelligence under Grant No. 18/CRT/6223 and the Insight the SFI Research Centre for Data Analytics
at Dublin City University under Grant No. SFI/12/RC/2289_P2.


References
 [1] Y. LeCun, Y. Bengio, Convolutional networks for images, speech, and time series, MIT Press,
     Cambridge, MA, USA, 1998, p. 255–258.
 [2] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, Cham, 2014, pp.
     818–833. doi:10.1007/978-3-319-10590-1_53.
 [3] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explana-
     tions from deep networks via gradient-based localization, 2017. doi:10.1109/ICCV.2017.74.
 [4] G. Ras, N. Xie, M. van Gerven, D. Doran, Explainable deep learning: A field guide for the uninitiated
     73 (2022). URL: https://doi.org/10.1613/jair.1.13200.
 [5] E. Ferreira dos Santos, A. Mileo, From disentangled representation to concept ranking: In-
     terpreting deep representations in image classification tasks, Springer, 2023. doi:10.1007/
     978-3-031-23618-1_22.
 [6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
     M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words:
     Transformers for image recognition at scale, ArXiv abs/2010.11929 (2020). URL: https://api.
     semanticscholar.org/CorpusID:225039882.
 [7] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
     Attention is all you need, 2017. URL: https://api.semanticscholar.org/CorpusID:13756489.
 [8] A. Mandal, S. Leavy, S. Little, Biased attention: Do vision transformers amplify gender bias more
     than convolutional neural networks?, 2023. URL: https://papers.bmvc2023.org/0629.pdf.
 [9] D. Minh, H. X. Wang, Y. F. Li, T. N. Nguyen, Explainable artificial intelligence: a comprehensive
     review, Artificial Intelligence Review (2021). doi:10.1007/s10462-021-10088-y.
[10] E. Poeta, G. Ciravegna, E. Pastor, T. Cerquitelli, E. Baralis, Concept-based explainable artificial
     intelligence: A survey, 2023. URL: https://arxiv.org/abs/2312.12936.
[11] X. Zhao, Y. Deng, M. Yang, L. Wang, R. Zhang, H. Cheng, W. Lam, Y. Shen, R. Xu, A comprehensive
     survey on relation extraction: Recent advances and new frontiers (2024). doi:10.1145/3674501.
[12] H. Wang, K. Qin, R. Y. Zakari, G. Lu, J. Yin, Deep neural network-based relation extraction: an
     overview, Neural Computing and Applications (2022). doi:10.1007/s00521-021-06667-3.
[13] P. P. Angelov, E. A. Soares, R. Jiang, N. I. Arnold, P. M. Atkinson, Explainable artificial intelligence:
     an analytical review (2021). doi:10.1002/widm.1424.
[14] G. Mutahar, T. Miller, Concept-based explanations using non-negative concept activa-
     tion vectors and decision tree for cnn models, 2022. URL: https://arxiv.org/abs/2211.10807.
     arXiv:2211.10807.
[15] B. Zhou, D. Bau, A. Oliva, A. Torralba, Interpreting deep visual representations via network
     dissection 41 (2019) 2131–2145. doi:10.1109/TPAMI.2018.2858759.
[16] Q. Zhang, Y. Yang, H. Ma, Y. Wu, Interpreting cnns via decision trees, Los Alamitos, CA, USA,
     2019, pp. 6254–6263. doi:10.1109/CVPR.2019.00642.
[17] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, R. Sayres, Interpretability beyond
     feature attribution: Quantitative testing with concept activation vectors (tcav), 2018. URL: https:
     //arxiv.org/abs/1711.11279. arXiv:1711.11279.
[18] X. Zou, A survey on application of knowledge graph, Journal of Physics: Conference Series (2020).
     doi:10.1088/1742-6596/1487/1/012016.
[19] S. Choudhary, T. Luthra, A. Mittal, R. Singh, A survey of knowledge graph embedding and their
     applications, 2021. URL: https://arxiv.org/abs/2107.07842. arXiv:2107.07842.
[20] L. Zhong, J. Wu, Q. Li, H. Peng, X. Wu, A comprehensive survey on automatic knowledge graph
     construction, ACM Comput. Surv. 56 (2023). URL: https://doi.org/10.1145/3618295.
[21] F. Ilievski, P. Szekely, B. Zhang, Cskg: The commonsense knowledge graph, 2021. URL: https:
     //arxiv.org/abs/2012.11490. arXiv:2012.11490.
[22] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A.
     Shamma, M. S. Bernstein, L. Fei-Fei, Visual genome: Connecting language and vision using
     crowdsourced dense image annotations 123 (2017). doi:10.1007/s11263-016-0981-7.
[23] R. T. Icarte, J. A. Baier, C. Ruz, A. Soto, How a general-purpose commonsense ontology can
     improve performance of learning-based image retrieval, 2017, pp. 1283–1289. doi:10.24963/
     ijcai.2017/178.
[24] I. Tiddi, S. Schlobach, Knowledge graphs as tools for explainable machine learning: A survey 302
     (2022). doi:10.1016/j.artint.2021.103627.
[25] N. Maillot, M. Thonnat, Ontology based complex object recognition, Image Vis. Comput. 26 (2008)
     102–113. URL: https://api.semanticscholar.org/CorpusID:11507373.
[26] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, L. Fei-Fei, Human action recognition by learning
     bases of action attributes and parts, 2011, pp. 1331–1338. doi:10.1109/ICCV.2011.6126386.
[27] A. Krizhevsky, Learning multiple layers of features from tiny images, 2009. URL: https://api.
     semanticscholar.org/CorpusID:18268744.
[28] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition (2015) 770–778. URL:
     https://api.semanticscholar.org/CorpusID:206594692.
[29] M. U. Hadi, Q. A. Tashi, R. Qureshi, A. Shah, A. Muneer, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar,
     S. Z. Hassan, M. Shoman, J. Wu, S. Mirjalili, M. Shah, Large language models: A comprehensive
     survey of its applications, challenges, limitations, and future prospects (2024). doi:10.36227/
     techrxiv.23589741.v7.
[30] T. P. Oikarinen, T.-W. Weng, Clip-dissect: Automatic description of neuron representations in
     deep vision networks (2022). URL: https://api.semanticscholar.org/CorpusID:248376976.

</pre>