Enhancing Semantic Understanding in Vision Language
                                Models Using Meaning Representation Negative Generation
                                Ziyi Shou, Fangzhen Lin
                                HKUST-Xiaoi Joint Laboratory
                                Department of Computer Science and Engineering
                                Hong Kong University of Science and Technology


                                                  Abstract
                                                  Vision language models have been criticized for their performance resembling bag-of-words models, lacking semantic
                                                  understanding. Efforts to address this concern have included the integration of composition aware negative samples into
                                                  contrastive learning methodologies. However, current negative generation methods show restricted semantic comprehension,
                                                  diversity, and fluency. To tackle this issue, we propose leveraging Abstract Meaning Representation (AMR), a representation
                                                  of considerable interest in natural language processing research, for negative sample generation. By altering the structure of
                                                  the meaning representation, we create negative samples with entirely different meanings but share close plain paraphrases.
                                                  These negatives, generated using AMR, are then incorporated alongside token swap negatives during contrastive training.
                                                  Our results indicate that AMR generated negatives introduce significantly diverse patterns. Furthermore, the inclusion of
                                                  AMR generated negative samples enhances the models’ performance across a range of compositional understanding tasks.

                                                  Keywords
                                                  Vision Language Models, Semantic Understanding, Compositional Understanding, Abstract Meaning Representation


                                1. Introduction                                                                                                  batch, challenging the model to discern the correct cap-
                                                                                                                                                 tion amidst such variations. For example, NegCLIP [5]
                                In recent years, the conspicuous development of vision constructs negative image captions by swapping tokens.
                                language models (VLMs) across various tasks is evident However, token swap methods lack semantic understand-
                                [1, 2, 3]. However, VLMs have been criticized for per- ing, resulting in patterns, and lack of plausibility and
                                forming akin to bag-of-words models, lacking semantic fluency. Blind Models trained solely on text, without
                                understanding, especially compositional understanding considering images, may manipulate evaluations to their
                                [4, 3, 5]. For instance, when some tokens in the caption advantage [6].
                                of an image-caption pair are rearranged to result in an                                                             Meaning representations offer an alternative approach
                                unaligned caption, a VLM may fail to notice the change. to constructing negative samples with greater diversity
                                Consider the two image-caption pairs in Figure 1. In and fluency. Abstract Meaning Representation (AMR,
                                the left side pair, the phrases ”Three Jack-O-Lanterns” [7]) stands out as a prevalent semantic representation in
                                and ”flowers” in its caption are swapped, resulting in a text tasks and is valued for its high expressiveness and
                                semantically very different sentence. But CLIP fails to human-friendly comprehension, which encodes concepts
                                notice the difference and somehow gives the modified as nodes and depicts the relationships between concepts
                                caption a slightly higher similarity score. A similar effect through graphical representations. We propose to utilize
                                can be seen in the right side image-caption pair, when AMR to create negative samples that possess entirely
                                the phrases ”Clock tower” and ”a bronze statue” in its distinct meanings but share close plain paraphrases. To
                                caption are swapped. These are not isolated examples. achieve this, we modify the structure of meaning repre-
                                As Yuksekgonul et al. [5] pointed out, VLMs ”behave sentation by randomly shuffling the positions of subtrees
                                like bags-of-words” because they have been mostly pre- within AMR graphs and reconstructing meaning repre-
                                trained on large-scale web datasets for retrieval tasks sentations. Following this process, negative captions are
                                where image and caption matching can often be done generated from the new meaning representations using
                                using keywords alone.                                                                                            an AMR generator. We blend our generated negatives
                                              A straightforward and effective solution involves min- with token swap negatives to broaden the diversity of neg-
                                ing hard negative samples for contrastive learning. This ative samples and enhance generalization. Subsequently,
                                entails including negative instances with similar seman- vision language models undergo training to distinguish
                                tic components but distinct relationships in the same between true labels and negative samples.
                                                                                                                                                    Our findings indicate that incorporating negative sam-
                                KiL’24: Workshop on Knowledge-infused Learning co-located with ples generated from meaning representations improves
                                30th ACM KDD Conference, August 26, 2024, Barcelona, Spain
                                Envelope-Open zshou@cse.ust.hk (Z. Shou); flin@cse.ust.hk (F. Lin)
                                                                                                                                                 model performance across diverse compositional under-
                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License standing benchmarks. Additionally, our generated nega-
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
             Three Jack-O-Laturns of various         CLIP Score:               Clock tower with a bronze statue   CLIP Score :
  Aligned                                                           Aligned
             shapes, one of which has flowers in it. 0.273                     on top on a sunny day.             0.301

             Flowers of various shapes, one of      CLIP Score:                A bronze statue with a clock       CLIP Score:
 Unaligned                                                         Unaligned
             which has Three Jack-O-Lanterns in it. 0.288                      tower on top on a sunny day.       0.306


Figure 1: Example test results of the model’s relational understanding. CLIP gives higher similarity scores for unaligned
captions.


tives introduce various patterns, enriching the diversity  they can be vulnerable to exploitation, as the patterns of
of augmentations compared to token swap negatives.         modification may become predictable even without con-
                                                           sidering information from the image encoder. [9] initially
                                                           parse the syntactic structure of the caption. They then
2. Related Work                                            randomly mask text and utilize a large language model to
                                                           unmask and generate a new negative caption. While the
2.1. AMR Data Augmentation                                 resulting caption tends to exhibit improved grammatical
AMR encodes concepts as nodes and illustrates the rela- correctness, the modification process lacks fine control,
tionships between these concepts as edges. It has been and the generated variants remain somewhat constrained
shown to be advantageous in various natural language in scope. To address the limitations of semantic modifica-
processing tasks, such as data augmentation. Token edit tion, [10] proposes leveraging scene graphs to generate
data augmentations in NLP often result in generating ill- semantic negative captions. They implement a strategy
formed or incoherent sentences, as they do not consider where they interchange the positions of the subject and
sentence structures. AMR Data Augmentation (AMR-DA) object within the same relation, as well as swap the at-
[8] suggests utilizing AMR for data augmentation. They tributes of different objects. However, the modification
construct positive samples by meticulously controlling of scene graphs is limited. Compared to scene graphs,
minor nuances within a carefully designed framework meaning representations encode a more extensive range
for meaning representation. Consequently, they produce of relations, especially higher-level abstract semantic re-
several fluent and distinct positive augmentations for lations absent in scene graphs [11]. This suggests that
the given sentences. Inspired by AMR-DA, we explore meaning representations have a higher potential to im-
the utilization of AMR in compositional understanding prove downstream tasks that require an understanding
tasks for vision language models. However, our approach of higher-level semantic information in images.
diverges significantly; rather than focusing on careful
modifications to meaning representation for positive sam-
ple generation, we propose employing AMR for negative
                                                           3. Methods
sample generation. Our methodology involves splitting
                                                           3.1. Extensive Contrastive Learning
the meaning representation and shuffling its components
to construct a new negative representation.                The aim of contrastive learning is to bring similar rep-
                                                           resentations into closer proximity while simultaneously
2.2. Composition-aware Hard Negatives                      pushing apart dissimilar samples. This principle mirrors
                                                           its application within vision language model training, ex-
For generating negative captions for contrastive learning, emplified by Contrastive Language-Image Pre-Training
a straightforward approach involves modifying linguistic (CLIP, [1]), which has emerged as a prominent paradigm
elements. To improve compositional understanding, [5] in vision language learning. The training objective of
leverage Spacy for syntactic analysis to identify and swap CLIP is to align text-image pairs effectively. CLIP simulta-
the positions of two elements within the caption. The neously trains an image encoder and a text encoder to ex-
token swap modifications aimed at creating variations in tract feature representations from each modality, denoted
composition are relatively straightforward to implement as 𝐼𝑛 for image features and 𝑇𝑛 for text features. These fea-
but often struggle to maintain grammaticality. Moreover, tures are then utilized to compute scaled pairwise cosine
                   Captions for Original Images               Captions for Hard Images             Negative Captions for Original Images Negative Captions for Hard Images

                   A small child wearing                     A young professional is                  Children's headphones are            When I was a young reader ,
                   headphones plays on the                   working at his laptop while his          small enough to wear while           my professional work was on
                   computer.                                 coworker is reading material.            using the computer to play.          a laptop with a co - worker .


                                                                                                     Text
                                                                                                   Encoder


                                                   𝐼! ⋅ 𝑇!          …         𝐼! ⋅ 𝑁𝑇!         …        𝐼! ⋅ 𝑇!"     …         𝐼! ⋅ 𝑁𝑇!"    …


 Original Images                                     …
                           Image
                          Encoder
                                                  𝑁𝐼! ⋅ 𝑇!          …        𝑁𝐼! ⋅ 𝑁𝑇!         …       𝑁𝐼! ⋅ 𝑇!"     …        𝑁𝐼! ⋅ 𝑁𝑇!"    …


                                                     …


  Hard Images


Figure 2: Extensive CLIP for compositional understanding tasks through extensive training with hard neighbor images and
AMR generated hard negative captions.


similarities, serving as logits. Finally, a symmetric cross-                             3.2. AMR for Negative Sample Generation
entropy loss is computed over these similarity scores to
                                                                                         Contrary to token swap negative generation, we propose
guide the training process effectively.
                                                                                         to the generation of negative samples using AMR. AMR
   In response to the challenge of vision language models
                                                                                         encodes the semantics into graphs and has demonstrated
struggling to comprehend text composition, we adopt
                                                                                         effectiveness as an intermediate representation in nat-
the approach proposed by Yuksekgonul et al. [5], which
                                                                                         ural language augmentation tasks. We adopt a similar
introduced two extensive components to standard con-
                                                                                         pipeline to AMR-DA [8]: parsing sentences into AMR,
trastive learning, aimed at increasing the complexity of
                                                                                         modifying the AMR, and generating samples from the
model learning. This entails (1) introducing challenging
                                                                                         modified AMR. However, our objective differs signifi-
images for the image encoder to extract features from,
                                                                                         cantly from that of AMR-DA. While they meticulously
selected based on CLIP encoding and utilizing nearest
                                                                                         modify the intermediate AMR to construct positive sam-
neighbors of original images, and (2) incorporating hard
                                                                                         ples, our task requires generating entirely different se-
negative captions for the text encoder to distinguish fea-
                                                                                         mantic representations, albeit with the same semantic
tures. The difference is that we add AMR generated
                                                                                         components as given samples.
negative samples into hard negative captions, with modi-
fications aimed at preserving most plain text tokens while
completely distorting the semantic meaning. Figure 2                                     3.2.1. Meaning Representation
illustrates the training pipeline. In each batch, original                               Abstract Meaning Representation (AMR, [7]) is a rooted,
images 𝐼𝑛 and their nearest neighbors 𝑁 𝐼𝑛 are included.                                 directed graph that encodes sentence concepts as nodes
Corresponding captions 𝑇𝑛 and 𝑁 𝑇𝑛 are concatenated                                      and the relations between these concepts as directed
with hard negative captions 𝑇𝑛− and 𝑁 𝑇𝑛− , doubling the                                 edges. In Figure 3, the leftmost portion depicts the AMR
length of captions compared to the number of images.                                     graph corresponding to the caption ”A trunk carries a
Subsequently, a symmetric cross-entropy loss is com-                                     large amount of items and a few people.” In this graph,
puted as in CLIP. However, only column-wise loss for                                     the root ”carry” serves as the primary predicate of the
positive captions is incorporated, as negative captions                                  sentence, with ”trunk” designated as the first argument
lack corresponding images for comparison.                                                (denoted as ARG0) of ”carry,” while the subtree originat-
                                                                                         ing from ”and” represents the second argument. AMR
 A truck carries a large amount of items and a few people.

                 AMR Parsing

            carry
       ARG0           ARG1                                            ARG1
                                                      carry
    trunk               and                                           and
                  OP1         OP2                                                                         carry
                                             ARG0                              OP2
              item            person                           OP1
                                             trunk                           person                OP1    ARG0    ARG1
                quant           quant
                                                               item
            amount             few                                                          item          trunk     and
                                              quant
                mod                                                          quant
                                            amount            mod                                 quant    mod    OP2
              large
                                                              large          few                          large     person
                                                                                            few
                                                                                                                         quant
                                                       Split and Reconstruct                                        amount

                                                                                                             AMR Generation

                                                        The items are carried by a few large trucks and an amount of people .


Figure 3: Negative example generated based on AMR. The shuffled AMR entails reordering all nodes along with their edges
except the root node.
                                                         .


facilitates readability for both human and machine com-          tent and produce new samples closely resembling the
prehension and can be adapted to various purposes as             given graph, this flexibility provides greater latitude for
needed. In this study, our proposal involves splitting           modifying the AMR graph compared to rule-based meth-
the AMR graph, shuffling its components, and then re-            ods. For instance, in Figure 3, although the modified
constructing a new AMR graph. This process aims to               graph contains some illogical elements such as the node
create a hard negative graph where all semantic parts are        ”and” lacking children, the generator is still capable of
retained, but the overall meaning is distorted.                  generating fluent and grammatically correct text.

3.2.2. Generation Pipeline                                        3.2.3. AMR Split and Reconstruct
The entire pipeline is illustrated in Figure 3. We adopt         The key component of generating negative samples
AMR-DA pipeline, which involves initially parsing the            through AMR lies in our split and reconstruct algorithm.
caption into an AMR graph using an AMR parser. Sub-              Unlike existing methods that rely on token swapping
sequently, we modify this AMR graph and finally utilize          within the sentence or node swapping in the scene graph
an AMR generator to produce negative captions based              based on predefined rules, our approach offers greater
on the modified AMR. We utilize SPRING parser [12] as            flexibility by directly modifying the entire meaning repre-
our AMR parser. SPRING employs a depth-first search              sentations. Modifications to AMR afford a broader range
method to linearize AMRs and utilizes a special token            of possibilities owing to the diverse types of edges and
< 𝑅𝑛 > to manage co-referring nodes. The parser is               nodes present.
trained based on BART model [13]. After obtaining the               In our algorithm, we split the AMR graph and regard
AMR graph for the caption, we propose a split and recon-         the root node as a separate entity, while treating other
struct algorithm to construct a new AMR graph, which is          nodes along with their incoming edges as edge-node
described in detail in the subsequent paragraphs. Finally,       pairs. As illustrated in Figure 3, the left-hand side depicts
we employ PLMs-Generator [14] based on T5-base as                the AMR graph corresponding to the original caption ”A
our AMR generator to convert AMR to text. The model-             trunk carries a large amount of items and a few people.”
based generator exhibits tolerance, allowing for the ac-         Following the split process, we obtain a root node and
commodation of certain unreasonable aspects within our           a collection of edge-node pairs such as ”carry, [(:ARG0,
modified graph. AMR generator can rectify to some ex-            trunk), (:ARG1, and), ...]”.
Algorithm 1 Negative AMR Generation
Ensure: Negative_G                                                                  ▷ Output Negative AMR graph
Require: G                                                                                      ▷ Input AMR graph
  root_node, list_of_edge_node_pairs = split_graph(G)                                              ▷ Split the graph
  list_of_edge_node_pairs = random.shuffle(list_of_edge_node_pairs)
  Negative_G←[(root, root_node)]
  Node_stack←[root_node]
  depth←1
  for edge, node in list_of_edge_node_pairs do
       choice = random.choice([*range(1, depth + 1, 1)])
       if choice = 1 then                                                                            ▷ To next level
           Negative_G.append(Node_stack[-1], edge, node)
           Node_stack.append(node)
           depth += 1
       else                                ▷ choice = 2: At current level; choice = n: back to the previous N level
           move_forward_depth = choice - 2
           depth -= move_forward_depth
           while move_forward_depth > -1 do
               Node_stack.pop(-1)
               move_forward_depth -= 1
           end while
           Negative_G.append(Node_stack[-1], edge, node)
           Node_stack.append(node)
       end if
  end for


   Next, we proceed to reconstruct a semantic tree by         overall semantic meaning of the sentence by selectively
randomly concatenating nodes from the split parts. We         adding or removing nuanced semantic components. On
shuffle the list of edge-node pairs and sequentially select   the other hand, negative AMR generation focuses on
edge-node pairs one by one. The process begins at layer       retaining the majority of the semantic components while
1 with the root node. At this stage, the first node has       generating entirely different semantic representations.
only one option, which is to connect to the root node and
move to layer 2. Subsequently, at layer 2, the subsequent
nodes have two options: either to remain at layer 2 by        4. Experiments
connecting to the root node, or to move to a deeper layer
                                                              We conduct experiments on different evaluation datasets
by connecting to the previous node at layer 2. If a node
                                                              to explore the impact of AMR generated negatives on the
moves to a deeper layer, for instance, layer 3, the subse-
                                                              performance of vision language models in compositional
quent node has three options: to remain at the current
                                                              understanding tasks.
layer, to move deeper, or to move back to the previous
layer. This iterative process continues until all nodes are
connected within the semantic tree. In Figure 3, when         4.1. Experimental Settings
considering the pair (:mod, large), there are indeed three    We explore whether AMR generated negatives improve
options available. The node ”large” can either remain at      the performance of model compositional understanding,
the current layer by connecting to the node ”trunk”, pro-     so we follow the training setups in NegCLIP[5], which
ceed to a deeper layer by connecting to the node ”few”, or    finetune CLIP based on the ViT-B/32 1 on the COCO
revert back to connect with the root node. The shuffled       dataset with token swap hard negatives.
AMR entails reordering all nodes along with their edges          For negative captions, we assign a specific probability
except the root node, resulting in a new representation       to replace the original token swap caption with AMR
of meaning. Negative captions are then generated based        generated negative augmentation. In the main results,
on this shuffled AMR. The algorithm to reconstruct AMR        the possibility of replacing negatives in NegCLIP is set
graph is illustrated in Algorithm 1.                          at 30%. In other words, about 30% of the captions with
   The distinction between negative AMR generation and        our AMR generated hard negative captions, while the
AMR-DA lies in their respective objectives. AMR-DA
aims to regulate modifications to avoid distorting the        1
                                                                  https://github.com/openai/CLIP
Table 1
ARO and SugarCrepe results comparison of AMR-NegCLIP with different models.

                                                         ARO                                 SugarCrepe
                                      Visual Gnome             Flickr30k   COCO        All Datasets Avg
                                  Relation Attribution           Order     Order    Replace Swap Add
               ViT-B-32              51.1         61.3           47.2       37.1      80.8       63.3     75.1
               CLIP                  59.9         63.2           59.5       46.0      84.8       70.8     85.6
               NegCLIP               81.0         71.0           91.0       86.0      85.4       75.3     87.3
               AMR-NegCLIP           83.2         75.6           93.9       91.6      86.4       81.2     87.5


remainder with original token swap negative samples, are         4.3. Main Results
utilized for contrastive training. This approach ensures a
                                                                 We incorporate AMR generated negative samples into
diverse range of negatives is maintained. The comparison
                                                                 our contrastive training data, simplifying our method to
of different probabilities is included in Section 5.3. For
                                                                 AMR-NegCLIP. In this study, we undertake a comparative
each image, one of the three nearest negative neighbors,
                                                                 analysis of the outcomes generated by our AMR-NegCLIP
determined by CLIP encoding, is sampled as the hard
                                                                 approach in contrast to the results produced by several
image.
                                                                 baseline models, ViT-B-32, standard CLIP finetuned with
   NegCLIP initially sets the batch size to 1024. How-
                                                                 COCO dataset (CLIP), and CLIP finetuned with token-
ever, due to device limitations, we are constrained to
                                                                 level hard negatives (NegCLIP).
train the model on a single NVIDIA RTX 2080 Ti GPU,
                                                                    From Table 1, we can find that our AMR-NegCLIP
reducing our batch size to 32. Consequently, we adjust
                                                                 achieves superior performance across all subtasks. In
the warm-up steps to 1600. Contrastive learning relies
                                                                 Visual Gnome dataset, AMR-NegCLIP gets a 2.2% im-
on batch size, as it involves contrasting samples within
                                                                 provement in Relation task over NegCLIP and a 4.6%
each batch. Therefore, larger batch sizes are anticipated
                                                                 improvement in Attribution task. In Flickr30k Order
to yield greater improvements. We employ the AdamW
                                                                 dataset, there is a 2.9% improvement compared to Neg-
optimizer with a cosine annealing schedule for a train-
                                                                 CLIP and a substantial 34.4% improvement over CLIP. In
ing epoch of 5. The learning rate is explored within the
                                                                 the COCO Order dataset, there is a 5.6% improvement
range of 1e-5, 5e-6, 1e-6, with reported results utilizing a
                                                                 over NegCLIP and an impressive 45.6% improvement
learning rate of 5e-6.
                                                                 over CLIP. In Replace and Add tasks within SugarCrepe,
                                                                 AMR-NegCLIP exhibits limited improvements when con-
4.2. Evaluation Dataset                                          trasted with NegCLIP, with 1.0% in Replace task and
We assess the efficacy of our approach on two widely             0.2% improvement in Add task. This discrepancy can be
used benchmarks for compositional understanding: ARO             attributed to the nature of the Replace and Add tasks,
[5] and SugarCrepe [6]. ARO stands for Attribution,              which involve modifying concepts within the caption.
Relation, and Order, including four tasks: Visual Genome         AMR-NegCLIP generates negatives that maintain the
Relation (VG-Relation) and Visual Genome Attribution             same concepts as the positive caption, thereby not en-
(VG-Attribution) tasks entail selecting the correct cap-         tirely aligning with the task requirements. In contrast,
tion from two options, where negative captions alter             another notable observation is a significant improvement,
either the object of the relation or the object’s attribution.   5.9% over NegCLIP, in the Swap task of SugarCrepe, a
Flickr30k Order and COCO Order tasks demand models               challenge that proves to be particularly daunting for pre-
to accurately identify the order of captions from five op-       trained CLIP models, as highlighted in the SugarCrepe
tions, where negative captions modify the order of tokens        paper [6]. In their study, SugarCrepe authors evaluate
within the caption. SugarCrepe aims to address the issue         over ten vision language models and note that ”all models
of negative captions being implausible and non-fluent by         struggle at identifying SWAP hard negatives, regardless of
employing large language models to generate fluent and           their pertaining dataset and model size.”. This difficulty
challenging negative captions. The dataset encompasses           arises from the nature of the swap action in SugarCrepe,
three tasks: Replace, Swap, and Add, which entail vari-          which involves neither adding nor excluding any con-
ous actions aimed at evaluating models’ compositional            cepts but rather swapping objects or attributes while
understanding.                                                   maintaining fluency and grammatical correctness, a task
                                                                 demanding a deeper understanding of composition from
                                                                 vision language models. This closely aligns with our
                                                                 motivation to employ meaning representations in the
Table 2
Example evaluation data of Visual Genome Relation, Flickr30k Order in ARO; Replace, Swap and Add in SugarCrepe. The
italicized text represents a positive caption for the sample, while the other lines contain negative captions. Visual Genome
includes two captions per sample, whereas Order test set includes five captions per sample.
  Visual Genome Relation                 the door is to the left of the shirt.
                                         the shirt is to the left of the door.
  Flickr30k Order                        A group of people standing on the lawn in front of a building.
                                         Many people in blue jeans stand in front of a white church.
                                         A large group of people stand outside of a church.
                                         Family members standing outside a home.
                                         People standing outside of a building.
  SugarCrepe Replace                     A tan toilet and sink combination in a small room.
                                         A white toilet and sink combination in a small room.
  SugarCrepe Swap                        Three large horses eating hay while a small horse stands behind.
                                         A small horse eating hay while three large horses stand behind.
  SugarCrepe Add                         Two zebras are battling each other on hind legs.
                                         Two striped-and-spotted zebras are battling each other on hind legs.


Table 3
Negative Sentences generated using Random Token Swap, Scene Graph Node Swap and AMR Reconstruction.
  Source                                  A truck carries a large amount of items and a few people.
  Random Token Swap                       A amount carries a large truck of items and a few people .
  Scene Graph Node Swap                   A truck carries a few amount of items and a large people.
  AMR Reconstruction                      The items are carried by a few large trucks and an amount of people .
  Source                                  A pigeon greets three bicyclists on a park path.
  Random Token Swap                       A park greets three bicyclists on a pigeon path .
  Scene Graph Node Swap                   A bicyclist greets three pigeon on a park path.
  AMR Reconstruction                      Greetings , three pigeon bicyclers on the path have been parkled .
  Source                                  People walking pass a horse drawn carriage sitting at the curb.
  Random Token Swap                       People walking pass a horse drawn curb sitting at the carriage.
  Scene Graph Node Swap                   People sitting at a horse drawn carriage walking pass the curb.
  AMR Reconstruction                      People walking by the curb , horse sitting , carriage pulling .


negative generation. Example evaluation data for ARO                the difference between AMR and Scene Graphs through
and SugarCrepe are provided in Table 2.                             detailed statistical analysis on entity and relation catego-
   In Order evaluation dataset, negative samples exhibit            rization. Their conclusion highlights that AMR encodes
greater diversity. The introduction of Swap in Sugar-               a broader range of relationships, particularly abstract
Crepe aims to rectify instances of textual non-fluency              semantic relationships absent in scene graphs.
and implausibility, thereby rendering it more resilient                Some studies have also explored leveraging scene
against potential hacking attempts from blind models.               graphs to construct negative samples, particularly fo-
   In conclusion, the results indicate that integrating             cusing on token swapping, such as swapping asymmet-
AMR generated negative captions significantly improves              ric relations [15, 10, 5]. These methods have produced
VLM’s performance on various composition tasks, espe-               limited variants. However, our approach addresses the
cially dealing with high-level compositional understand-            entire semantic representation rather than specific to-
ing captions.                                                       ken swaps. To analyze the difference between outputs,
                                                                    we present the generated negative samples from Ran-
                                                                    dom Token Swap, Scene Graph Node Swap, and AMR
5. Analysis                                                         Reconstruction in Table 3.
                                                                       In contrast to Random Token Swap approach, leverag-
5.1. Comparison with Scene Graph                                    ing scene graphs yields a richer array of syntactic and
Understanding the meaning of images has long been a                 semantic cues. However, the generated negatives ad-
goal. Scene graphs have emerged as a popular method                 here to rule-based criteria, such as swapping exclusively
for encoding objects, their attributes, and relationships           between adjective words or words sharing a common re-
within graphs. Abdelsalam et al.’s work [11] discusses              lational structure. It is evident that AMR Reconstruction
Table 4                                                            Table 5
ARO performance comparison of different strategies. † : results    Comparison of ARO performance before and after replacing
from [10], applying semantic negative strategy; ‡ : results from   a portion of original negative samples with AMR generated
[15], incorporating Scene Graph Prediction in training.            negative samples.
                           Visual Gnome       Flickr30k   COCO                      Visual Gnome       Flickr30k   COCO
                                                                                                                           Average
                       Relation Attribution     Order     Order                 Relation Attribution     Order     Order
  CLIP                  59.9        63.2        59.5       46.0     CLIP          59.9       63.2        59.5       46.0    57.2
  NegCLIP               81.0        71.0        91.0       86.0     NegCLIP       81.0       71.0        91.0       86.0    82.3
  AMR-NegCLIP           83.2        75.6        93.9       91.6     Replace Ratio
  Semantic Negative†    79.0        77.8         -          -       10%           83.4       74.4        94.1       92.1    86.0
  CLIP-SGVL‡             -           -          82.0       78.2     20%           82.6       76.0        92.9       90.3    85.4
                                                                    30%           83.2       75.6        93.9       91.6    86.1
                                                                    40%           83.8       74.8        91.3       88.3    84.5
                                                                    50%           82.6       74.3        94.0       90.6    85.4
introduces a wider spectrum of variations to the original           60%           81.2       75.1        91.5       87.6    83.9
captions, all while upholding the core semantic compo-              70%           80.3       71.9        93.7       91.8    84.4
                                                                    80%           80.2       71.2        93.2       91.5    84.0
nents. Our methodology thus offers enhanced flexibility             90%           78.4       71.3        89.3       86.4    81.4
in generating negative training data.                               100%          75.0       69.4        83.4       80.9    77.2
   Furthermore, we compare AMR-NegCLIP with other
negative augmentation-based methods, Semantic Nega-
tive [10], which constructs negative samples using scene           cal scores and struggles to differentiate between captions,
graph node swaps, and CLIP-SGVL [15], which utilizes               whereas AMR-NegCLIP excels in selecting the correct
scene graphs in multiple ways, including positive and              option. Examples of Replace Relationship and Replace
negative caption generation, as well as scene graph pre-           Attribution tasks highlight instances where CLIP strug-
diction tasks, in Table 4. However, the training and vali-         gles to discern subtle yet crucial concept replacements.
dation data sets of Semantic Negative are different from           These nuances have been effectively addressed through
ours, but it can also be seen that it is challenging to im-        negative caption contrastive learning.
prove the accuracy of both relationships and attributes by
changing the negative samples. The findings indicate that
AMR-NegCLIP achieves superior average performance
                                                                   5.3. Performance Impact Analysis of AMR
in comparison to the Semantic Negative method. This                     Generated Negative Sample Ratios
observation underscores the efficacy of employing AMR              AMR generated negative samples tend to distort entire se-
generated negatives, which manifest more pronounced                mantic representations of given captions, while NegCLIP
enhancements when compared to the strategy of swap-                swaps the positions of tokens. Their generated negative
ping scene graph nodes. Negative sample generation                 samples address varying levels, from individual objects
rules in CLIP-SGVL are similar to those of Semantic Neg-           to complete semantics. To ensure augmented data spans
ative. Our AMR-NegCLIP demonstrated superior perfor-               different levels in the training dataset, we retain parts
mance in Order tasks with more variants.                           of negative samples from NegCLIP while replacing a ra-
                                                                   tio of NegCLIP samples with AMR generated negative
5.2. Case Study                                                    samples.
                                                                      To assess the impact of AMR generated negative sam-
We present several case studies illustrating the results of        ples on model performance, we replace NegCLIP nega-
CLIP and AMR-NegCLIP across four subtasks in Sugar-                tives at ratios ranging from 10% to 100%, and present the
Crepe, as depicted in Figure 4. SugarCrepe utilizes large          results in Table 5. When replacing only 10% of NegCLIP
language models to generate captions with a high degree            negatives with AMR generated negative samples, the
of fluency and commonsense understanding, thereby pos-             model performance exhibits noticeable improvements,
ing a challenge for VLMs to discern negative captions              particularly 6.1% in COCO Order subtasks. The best per-
effectively. For instance, in Swap Object task, VLMs must          formance is achieved when 30% of the token swap nega-
comprehend the semantics of relationships such as ”in”             tives are replaced by AMR-generated negatives. Across
and ”background”, as well as discern the object and sub-           replacement ratios ranging from 10% to 60%, the integra-
ject of these relationships. Our test results demonstrate          tion of AMR generated negatives yields improvements
that while CLIP exhibits closely aligned similarity scores         for NegCLIP across all subtasks. These enhancements are
between captions and negative captions, AMR-NegCLIP                consistently observed, with average performance gains
demonstrates superior discriminatory capability. Fur-              ranging from 1.6% to 3.8%. Beyond a 70% replacement
thermore, in Swap Attribution task, models are required            ratio, larger ratios result in decreased model performance.
to accurately identify quantities and the position of corre-       Specifically, when 90% and 100% of negative samples are
sponding objects to succeed. CLIP returns nearly identi-
                                                                  Swap Object
                                                                   AMR-                                                                       AMR-
                                                           CLIP                                                                       CLIP
                                                                  NegCLIP                                                                    NegCLIP
                        Caption: Three Jack-O-Laturns of
                                                                                                 Caption: A city street with a
                        various shapes, one of which has 0.273 0.349                                                                  0.313 0.391
                                                                                                 rainbow in the background.
                        flowers in it.
                        Negative Caption: Flowers of                                             Negative Caption: A rainbow
                        various shapes, one of which has 0.288 0.330                             with a city street in the            0.316 0.269
                        Three Jack-O-Lanterns in it.                                             background.
                                                              Swap Attribution
                                                                   AMR-                                                                       AMR-
                                                           CLIP                                                                       CLIP
                                                                  NegCLIP                                                                    NegCLIP
                                                                                                 Caption: A tennis player poses,
                        Caption: A couple is sitting on a
                                                                                                 racket in his right hand, left arm
                        statue of a horse and some plants. 0.331 0.281
                                                                                                                                      0.304 0.256
                                                                                                 behind him.
                        Negative Caption: Some couples                                           Negative Caption: A tennis player
                        are sitting on a statue of a horse 0.336 0.240                           poses, racket in his left hand,   0.307 0.249
                        and a plant.                                                             right arm behind him.
                    Replace Relationship                                                      Replace Attribution
                                                                   AMR-                                                                       AMR-
                                                           CLIP                                                                       CLIP
                                                                  NegCLIP                                                                    NegCLIP
                                                                                                 Caption: Two giraffes in a
                        Caption: Many skiers are walking
                                                         0.292 0.316                             sanctuary standing close to the      0.310 0.305
                        and skiing around the snow.
                                                                                                 wall.
                        Negative Caption: Many skiers                                            Negative Caption: Two giraffes in
                        are riding and skiing around the   0.293 0.285                           a sanctuary standing far from the 0.315 0.289
                        snow.                                                                    wall.


Figure 4: Predictions of CLIP and AMR-NegCLIP on SugarCrepe tasks: Swap Object, Swap Attribution, Replace Relationship
and Replace Attribution. The score represents the similarity score between the (Negative) caption and the corresponding
image as assessed by CLIP/AMR-NegCLIP. The model selects the caption with the higher similarity score as the correct one.


AMR generated, the performance is inferior to that of to-                   high-level comprehension. Furthermore, beyond simple
ken swap negatives but still superior to CLIP. The reason                   shuffling, AMR offers the potential for more controlled
for this phenomenon could be attributed to the greater                      modifications based on human instructions. For instance,
diversity of AMR generated negatives compared to to-                        users could add semantic components that are absent in
ken swap negatives. Unlike token swap negatives, which                      the picture to deliberately confuse VLMs. We view this
follow a unified pattern, AMR generated negatives lack                      as a promising avenue for future research.
such consistency, making it challenging for models to
effectively learn from them, particularly when the re-      Limitaions Conducting AMR parsing and generation
placement ratio is high. Therefore, we propose that our     typically requires GPU acceleration, which incurs higher
AMR generated negative captions can effectively com-        costs compared to direct token shuffling methods. How-
plement token swap generations.                             ever, when compared to tasks such as scene graph parsing
                                                            or querying large language models, it remains an efficient
                                                            approach. It’s worth noting that splitting and shuffling
6. Conclusion                                               AMR components introduce significant randomness in
To overcome the limitations of vision language models negative generation, and occasionally, this may lead to
in comprehending composition and semantics, we sug- suboptimal results.
gest constructing hard negative samples through splitting
and reconstructing AMR graphs. Compared to token and References
scene graph negative generation, AMR generated nega-
tives have greater diversity and keep the fluency at the     [1] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
most possible. Compared to token and scene graph nega-            G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
tive generation, AMR generated negatives exhibit greater          J. Clark, et al., Learning transferable visual mod-
diversity while maintaining optimal fluency. Our exper-           els from natural language supervision, in: Inter-
imental results illustrate that incorporating our gener-          national conference on machine learning, PMLR,
ated negatives in contrastive learning significantly boosts       2021, pp. 8748–8763.
model performance, particularly in tasks that demand
 [2] J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrap-          hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart:
     ping language-image pre-training for unified vision-      Denoising sequence-to-sequence pre-training for
     language understanding and generation, in: Inter-         natural language generation, translation, and com-
     national Conference on Machine Learning, PMLR,            prehension, in: Proceedings of the 58th Annual
     2022, pp. 12888–12900.                                    Meeting of the Association for Computational Lin-
 [3] T. Zhao, T. Zhang, M. Zhu, H. Shen, K. Lee, X. Lu,        guistics, 2020, pp. 7871–7880.
     J. Yin, Vl-checklist: Evaluating pre-trained vision- [14] L. F. Ribeiro, M. Schmitt, H. Schütze, I. Gurevych,
     language models with objects, attributes and rela-        Investigating pretrained language models for graph-
     tions, arXiv preprint arXiv:2207.00221 (2022).            to-text generation, in: Proceedings of the 3rd Work-
 [4] T. Thrush, R. Jiang, M. Bartolo, A. Singh,                shop on Natural Language Processing for Conver-
     A. Williams, D. Kiela, C. Ross, Winoground: Prob-         sational AI, 2021, pp. 211–227.
     ing vision and language models for visio-linguistic [15] R. Herzig, A. Mendelson, L. Karlinsky, A. Arbelle,
     compositionality, in: Proceedings of the IEEE/CVF         R. Feris, T. Darrell, A. Globerson, Incorporating
     Conference on Computer Vision and Pattern Recog-          structured representations into pretrained vision &
     nition, 2022, pp. 5238–5248.                              language models using scene graphs, arXiv preprint
 [5] M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky,      arXiv:2305.06343 (2023).
     J. Zou, When and why vision-language models
     behave like bags-of-words, and what to do about
     it?, in: The Eleventh International Conference on
     Learning Representations, 2022.
 [6] C.-Y. Hsieh, J. Zhang, Z. Ma, A. Kembhavi, R. Kr-
     ishna, Sugarcrepe: Fixing hackable benchmarks
     for vision-language compositionality, Advances in
     Neural Information Processing Systems 36 (2024).
 [7] L. Banarescu, C. Bonial, S. Cai, M. Georgescu,
     K. Griffitt, U. Hermjakob, K. Knight, P. Koehn,
     M. Palmer, N. Schneider, Abstract meaning rep-
     resentation for sembanking, in: Proceedings of the
     7th linguistic annotation workshop and interoper-
     ability with discourse, 2013, pp. 178–186.
 [8] Z. Shou, Y. Jiang, F. Lin, Amr-da: data augmentation
     by abstract meaning representation, in: Findings
     of the Association for Computational Linguistics:
     ACL 2022, 2022, pp. 3082–3098.
 [9] S. Doveh, A. Arbelle, S. Harary, E. Schwartz,
     R. Herzig, R. Giryes, R. Feris, R. Panda, S. Ullman,
     L. Karlinsky, Teaching structured vision & language
     concepts to vision & language models, in: Proceed-
     ings of the IEEE/CVF Conference on Computer Vi-
     sion and Pattern Recognition, 2023, pp. 2657–2668.
[10] Y. Huang, J. Tang, Z. Chen, R. Zhang, X. Zhang,
     W. Chen, Z. Zhao, T. Lv, Z. Hu, W. Zhang, Structure-
     clip: Enhance multi-modal language representa-
     tions with structure knowledge, arXiv preprint
     arXiv:2305.06152 (2023).
[11] M. A. Abdelsalam, Z. Shi, F. Fancellu, K. Basioti, D. J.
     Bhatt, V. Pavlovic, A. Fazly, Visual semantic parsing:
     From images to abstract meaning representation,
     arXiv preprint arXiv:2210.14862 (2022).
[12] M. Bevilacqua, R. Blloshmi, R. Navigli, One spring
     to rule them both: Symmetric amr semantic parsing
     and generation without a complex pipeline, in:
     Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 35, 2021, pp. 12564–12573.
[13] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-