<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Enhancing Semantic Understanding in Vision Language Models Using Meaning Representation Negative Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ziyi Shou</string-name>
          <email>zshou@cse.ust.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fangzhen Lin</string-name>
          <email>flin@cse.ust.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>HKUST-Xiaoi Joint Laboratory</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>30th ACM KDD Conference</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hong Kong University of Science and Technology</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Vision language models have been criticized for their performance resembling bag-of-words models, lacking semantic understanding. Eforts to address this concern have included the integration of composition aware negative samples into contrastive learning methodologies. However, current negative generation methods show restricted semantic comprehension, diversity, and fluency. To tackle this issue, we propose leveraging Abstract Meaning Representation (AMR), a representation of considerable interest in natural language processing research, for negative sample generation. By altering the structure of the meaning representation, we create negative samples with entirely diferent meanings but share close plain paraphrases. These negatives, generated using AMR, are then incorporated alongside token swap negatives during contrastive training. Our results indicate that AMR generated negatives introduce significantly diverse patterns. Furthermore, the inclusion of AMR generated negative samples enhances the models' performance across a range of compositional understanding tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Negative</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
1. Introduction</p>
    </sec>
    <sec id="sec-2">
      <title>In recent years, the conspicuous development of vision</title>
      <p>language models (VLMs) across various tasks is evident
forming akin to bag-of-words models, lacking semantic
understanding, especially compositional understanding
[4, 3, 5]. For instance, when some tokens in the caption
of an image-caption pair are rearranged to result in an
Consider the two image-caption pairs in Figure 1. In
the left side pair, the phrases ”Three Jack-O-Lanterns”
and ”flowers” in its caption are swapped, resulting in a
semantically very diferent sentence. But CLIP fails to
notice the diference and somehow gives the modified
caption a slightly higher similarity score. A similar efect
can be seen in the right side image-caption pair, when
the phrases ”Clock tower” and ”a bronze statue” in its
caption are swapped. These are not isolated examples.</p>
    </sec>
    <sec id="sec-3">
      <title>As Yuksekgonul et al. [5] pointed out, VLMs ”behave</title>
      <p>like bags-of-words” because they have been mostly
pretrained on large-scale web datasets for retrieval tasks
where image and caption matching can often be done
using keywords alone.</p>
    </sec>
    <sec id="sec-4">
      <title>A straightforward and efective solution involves mining hard negative samples for contrastive learning. This entails including negative instances with similar semantic components but distinct relationships in the same</title>
      <p>KiL’24: Workshop on Knowledge-infused Learning co-located with
batch, challenging the model to discern the correct
caption amidst such variations. For example, NegCLIP [5]
constructs negative image captions by swapping tokens.</p>
    </sec>
    <sec id="sec-5">
      <title>However, token swap methods lack semantic understandlfuency. Blind Models trained solely on text, without considering images, may manipulate evaluations to their advantage [6].</title>
      <p>Meaning representations ofer an alternative approach
and fluency. Abstract Meaning Representation (AMR,
[7]) stands out as a prevalent semantic representation in
text tasks and is valued for its high expressiveness and
human-friendly comprehension, which encodes concepts
as nodes and depicts the relationships between concepts
through graphical representations. We propose to utilize
AMR to create negative samples that possess entirely
distinct meanings but share close plain paraphrases. To
achieve this, we modify the structure of meaning
representation by randomly shufling the positions of subtrees
within AMR graphs and reconstructing meaning
representations. Following this process, negative captions are
generated from the new meaning representations using
an AMR generator. We blend our generated negatives
with token swap negatives to broaden the diversity of
negative samples and enhance generalization. Subsequently,
vision language models undergo training to distinguish
between true labels and negative samples.</p>
    </sec>
    <sec id="sec-6">
      <title>Our findings indicate that incorporating negative samples generated from meaning representations improves model performance across diverse compositional under</title>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License standing benchmarks. Additionally, our generated
nega</p>
      <sec id="sec-6-1">
        <title>Aligned</title>
      </sec>
      <sec id="sec-6-2">
        <title>Unaligned</title>
      </sec>
      <sec id="sec-6-3">
        <title>Three Jack-O-Laturns of various CLIP Score:</title>
        <p>shapes, one of which has flowers in it. 0.273</p>
      </sec>
      <sec id="sec-6-4">
        <title>Aligned</title>
      </sec>
      <sec id="sec-6-5">
        <title>Clock tower with a bronze statue on top on a sunny day.</title>
      </sec>
      <sec id="sec-6-6">
        <title>Flowers of various shapes, one of CLIP Score:</title>
        <p>which has Three Jack-O-Lanterns in it. 0.288</p>
      </sec>
      <sec id="sec-6-7">
        <title>A bronze statue with a clock</title>
        <p>Unaligned tower on top on a sunny day.</p>
      </sec>
      <sec id="sec-6-8">
        <title>CLIP Score : 0.301</title>
      </sec>
      <sec id="sec-6-9">
        <title>CLIP Score: 0.306</title>
        <p>tives introduce various patterns, enriching the diversity they can be vulnerable to exploitation, as the patterns of
of augmentations compared to token swap negatives. modification may become predictable even without
considering information from the image encoder. [9] initially
parse the syntactic structure of the caption. They then
2. Related Work randomly mask text and utilize a large language model to
unmask and generate a new negative caption. While the
2.1. AMR Data Augmentation resulting caption tends to exhibit improved grammatical
AMR encodes concepts as nodes and illustrates the rela- correctness, the modification process lacks fine control,
tionships between these concepts as edges. It has been and the generated variants remain somewhat constrained
shown to be advantageous in various natural language in scope. To address the limitations of semantic
modificaprocessing tasks, such as data augmentation. Token edit tion, [10] proposes leveraging scene graphs to generate
data augmentations in NLP often result in generating ill- semantic negative captions. They implement a strategy
formed or incoherent sentences, as they do not consider where they interchange the positions of the subject and
sentence structures. AMR Data Augmentation (AMR-DA) object within the same relation, as well as swap the
at[8] suggests utilizing AMR for data augmentation. They tributes of diferent objects. However, the modification
construct positive samples by meticulously controlling of scene graphs is limited. Compared to scene graphs,
minor nuances within a carefully designed framework meaning representations encode a more extensive range
for meaning representation. Consequently, they produce of relations, especially higher-level abstract semantic
reseveral fluent and distinct positive augmentations for lations absent in scene graphs [11]. This suggests that
the given sentences. Inspired by AMR-DA, we explore meaning representations have a higher potential to
imthe utilization of AMR in compositional understanding prove downstream tasks that require an understanding
tasks for vision language models. However, our approach of higher-level semantic information in images.
diverges significantly; rather than focusing on careful
modifications to meaning representation for positive sam- 3. Methods
ple generation, we propose employing AMR for negative
sample generation. Our methodology involves splitting 3.1. Extensive Contrastive Learning
the meaning representation and shufling its components
to construct a new negative representation.</p>
        <sec id="sec-6-9-1">
          <title>2.2. Composition-aware Hard Negatives</title>
          <p>For generating negative captions for contrastive learning,
a straightforward approach involves modifying linguistic
elements. To improve compositional understanding, [5]
leverage Spacy for syntactic analysis to identify and swap
the positions of two elements within the caption. The
token swap modifications aimed at creating variations in
composition are relatively straightforward to implement
but often struggle to maintain grammaticality. Moreover,
The aim of contrastive learning is to bring similar
representations into closer proximity while simultaneously
pushing apart dissimilar samples. This principle mirrors
its application within vision language model training,
exemplified by Contrastive Language-Image Pre-Training
(CLIP, [1]), which has emerged as a prominent paradigm
in vision language learning. The training objective of
CLIP is to align text-image pairs efectively. CLIP
simultaneously trains an image encoder and a text encoder to
extract feature representations from each modality, denoted
as   for image features and   for text features. These
features are then utilized to compute scaled pairwise cosine
Captions for Original Images
A small child wearing
headphones plays on the
computer.</p>
          <p>Captions for Hard Images
A young professional is
working at his laptop while his
coworker is reading material.</p>
          <p>Negative Captions for Original Images Negative Captions for Hard Images</p>
          <p>Children's headphones are
small enough to wear while
using the computer to play.</p>
          <p>When I was a young reader ,
my professional work was on
a laptop with a co - worker .</p>
          <p>Original Images
Hard Images</p>
          <p>Image
Encoder
! ⋅ !
…
! ⋅ !
…
…
…
! ⋅ !
! ⋅ !</p>
          <p>Text</p>
          <p>Encoder
…
…
! ⋅ !"
! ⋅ !"
…
…
! ⋅ !"
! ⋅ !"
…
…
similarities, serving as logits. Finally, a symmetric cross- 3.2. AMR for Negative Sample Generation
entropy loss is computed over these similarity scores to
guide the training process efectively. Contrary to token swap negative generation, we propose</p>
          <p>In response to the challenge of vision language models to the generation of negative samples using AMR. AMR
struggling to comprehend text composition, we adopt encodes the semantics into graphs and has demonstrated
the approach proposed by Yuksekgonul et al. [5], which efectiveness as an intermediate representation in
natintroduced two extensive components to standard con- ural language augmentation tasks. We adopt a similar
trastive learning, aimed at increasing the complexity of pipeline to AMR-DA [8]: parsing sentences into AMR,
model learning. This entails (1) introducing challenging modifying the AMR, and generating samples from the
images for the image encoder to extract features from, modified AMR. However, our objective difers
signifiselected based on CLIP encoding and utilizing nearest cantly from that of AMR-DA. While they meticulously
neighbors of original images, and (2) incorporating hard modify the intermediate AMR to construct positive
samnegative captions for the text encoder to distinguish fea- ples, our task requires generating entirely diferent
setures. The diference is that we add AMR generated mantic representations, albeit with the same semantic
negative samples into hard negative captions, with modi- components as given samples.
ifcations aimed at preserving most plain text tokens while
completely distorting the semantic meaning. Figure 2 3.2.1. Meaning Representation
illustrates the training pipeline. In each batch, original Abstract Meaning Representation (AMR, [7]) is a rooted,
images   and their nearest neighbors    are included. directed graph that encodes sentence concepts as nodes
Corresponding captions   and    are concatenated and the relations between these concepts as directed
with hard negative captions   − and    −, doubling the edges. In Figure 3, the leftmost portion depicts the AMR
length of captions compared to the number of images. graph corresponding to the caption ”A trunk carries a
Subsequently, a symmetric cross-entropy loss is com- large amount of items and a few people.” In this graph,
puted as in CLIP. However, only column-wise loss for the root ”carry” serves as the primary predicate of the
positive captions is incorporated, as negative captions sentence, with ”trunk” designated as the first argument
lack corresponding images for comparison. (denoted as ARG0) of ”carry,” while the subtree
originating from ”and” represents the second argument. AMR</p>
        </sec>
      </sec>
      <sec id="sec-6-10">
        <title>A truck carries a large amount of items and a few people.</title>
        <p>AMR Parsing
carry
ARG0</p>
        <p>ARG1
trunk</p>
        <p>and</p>
        <p>OP1
item</p>
        <p>quant
amount</p>
        <p>mod
large</p>
        <p>OP2
person</p>
        <p>quant
few</p>
        <p>ARG0
trunk
quant
amount</p>
        <p>OP1
item
mod
large
facilitates readability for both human and machine com- tent and produce new samples closely resembling the
prehension and can be adapted to various purposes as given graph, this flexibility provides greater latitude for
needed. In this study, our proposal involves splitting modifying the AMR graph compared to rule-based
meththe AMR graph, shufling its components, and then re- ods. For instance, in Figure 3, although the modified
constructing a new AMR graph. This process aims to graph contains some illogical elements such as the node
create a hard negative graph where all semantic parts are ”and” lacking children, the generator is still capable of
retained, but the overall meaning is distorted. generating fluent and grammatically correct text.
3.2.2. Generation Pipeline
3.2.3. AMR Split and Reconstruct
The entire pipeline is illustrated in Figure 3. We adopt The key component of generating negative samples
AMR-DA pipeline, which involves initially parsing the through AMR lies in our split and reconstruct algorithm.
caption into an AMR graph using an AMR parser. Sub- Unlike existing methods that rely on token swapping
sequently, we modify this AMR graph and finally utilize within the sentence or node swapping in the scene graph
an AMR generator to produce negative captions based based on predefined rules, our approach ofers greater
on the modified AMR. We utilize SPRING parser [ 12] as flexibility by directly modifying the entire meaning
repreour AMR parser. SPRING employs a depth-first search sentations. Modifications to AMR aford a broader range
method to linearize AMRs and utilizes a special token of possibilities owing to the diverse types of edges and
&lt;  &gt; to manage co-referring nodes. The parser is nodes present.
trained based on BART model [13]. After obtaining the In our algorithm, we split the AMR graph and regard
AMR graph for the caption, we propose a split and recon- the root node as a separate entity, while treating other
struct algorithm to construct a new AMR graph, which is nodes along with their incoming edges as edge-node
described in detail in the subsequent paragraphs. Finally, pairs. As illustrated in Figure 3, the left-hand side depicts
we employ PLMs-Generator [14] based on T5-base as the AMR graph corresponding to the original caption ”A
our AMR generator to convert AMR to text. The model- trunk carries a large amount of items and a few people.”
based generator exhibits tolerance, allowing for the ac- Following the split process, we obtain a root node and
commodation of certain unreasonable aspects within our a collection of edge-node pairs such as ”carry, [(:ARG0,
modified graph. AMR generator can rectify to some ex- trunk), (:ARG1, and), ...]”.
▷ Output Negative AMR graph
▷ Input AMR graph
▷ Split the graph
▷ To next level
else</p>
        <p>▷ choice = 2: At current level; choice = n: back to the previous N level</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Algorithm 1 Negative AMR Generation Ensure: Negative_G</title>
      <p>Require: G
root_node, list_of_edge_node_pairs = split_graph(G)
list_of_edge_node_pairs = random.shufle(list_of_edge_node_pairs)
Negative_G←[(root, root_node)]
Node_stack←[root_node]
depth←1
for edge, node in list_of_edge_node_pairs do
choice = random.choice([*range(1, depth + 1, 1)])
if choice = 1 then</p>
      <p>Negative_G.append(Node_stack[-1], edge, node)
Node_stack.append(node)
depth += 1
move_forward_depth = choice - 2
depth -= move_forward_depth
while move_forward_depth &gt; -1 do</p>
      <p>Node_stack.pop(-1)
move_forward_depth -= 1
end while
Negative_G.append(Node_stack[-1], edge, node)</p>
      <p>Node_stack.append(node)
end if
end for</p>
      <p>Next, we proceed to reconstruct a semantic tree by overall semantic meaning of the sentence by selectively
randomly concatenating nodes from the split parts. We adding or removing nuanced semantic components. On
shufle the list of edge-node pairs and sequentially select the other hand, negative AMR generation focuses on
edge-node pairs one by one. The process begins at layer retaining the majority of the semantic components while
1 with the root node. At this stage, the first node has generating entirely diferent semantic representations.
only one option, which is to connect to the root node and
move to layer 2. Subsequently, at layer 2, the subsequent
nodes have two options: either to remain at layer 2 by 4. Experiments
connecting to the root node, or to move to a deeper layer
by connecting to the previous node at layer 2. If a node We conduct experiments on diferent evaluation datasets
moves to a deeper layer, for instance, layer 3, the subse- to explore the impact of AMR generated negatives on the
quent node has three options: to remain at the current performance of vision language models in compositional
layer, to move deeper, or to move back to the previous understanding tasks.
layer. This iterative process continues until all nodes are
connected within the semantic tree. In Figure 3, when 4.1. Experimental Settings
considering the pair (:mod, large), there are indeed three We explore whether AMR generated negatives improve
options available. The node ”large” can either remain at the performance of model compositional understanding,
the current layer by connecting to the node ”trunk”, pro- so we follow the training setups in NegCLIP[5], which
ceed to a deeper layer by connecting to the node ”few”, or ifnetune CLIP based on the ViT-B/32 1 on the COCO
revert back to connect with the root node. The shufled dataset with token swap hard negatives.
AMR entails reordering all nodes along with their edges For negative captions, we assign a specific probability
except the root node, resulting in a new representation to replace the original token swap caption with AMR
of meaning. Negative captions are then generated based generated negative augmentation. In the main results,
on this shufled AMR. The algorithm to reconstruct AMR the possibility of replacing negatives in NegCLIP is set
graph is illustrated in Algorithm 1. at 30%. In other words, about 30% of the captions with</p>
      <p>The distinction between negative AMR generation and our AMR generated hard negative captions, while the
AMR-DA lies in their respective objectives. AMR-DA
aims to regulate modifications to avoid distorting the 1https://github.com/openai/CLIP
remainder with original token swap negative samples, are 4.3. Main Results
utilized for contrastive training. This approach ensures a
diverse range of negatives is maintained. The comparison We incorporate AMR generated negative samples into
of diferent probabilities is included in Section 5.3. For our contrastive training data, simplifying our method to
each image, one of the three nearest negative neighbors, AMR-NegCLIP. In this study, we undertake a comparative
determined by CLIP encoding, is sampled as the hard analysis of the outcomes generated by our AMR-NegCLIP
image. approach in contrast to the results produced by several</p>
      <p>NegCLIP initially sets the batch size to 1024. How- baseline models, ViT-B-32, standard CLIP finetuned with
ever, due to device limitations, we are constrained to COCO dataset (CLIP), and CLIP finetuned with
tokentrain the model on a single NVIDIA RTX 2080 Ti GPU, level hard negatives (NegCLIP).
reducing our batch size to 32. Consequently, we adjust From Table 1, we can find that our AMR-NegCLIP
the warm-up steps to 1600. Contrastive learning relies achieves superior performance across all subtasks. In
on batch size, as it involves contrasting samples within Visual Gnome dataset, AMR-NegCLIP gets a 2.2%
imeach batch. Therefore, larger batch sizes are anticipated provement in Relation task over NegCLIP and a 4.6%
to yield greater improvements. We employ the AdamW improvement in Attribution task. In Flickr30k Order
optimizer with a cosine annealing schedule for a train- dataset, there is a 2.9% improvement compared to
Neging epoch of 5. The learning rate is explored within the CLIP and a substantial 34.4% improvement over CLIP. In
range of 1e-5, 5e-6, 1e-6, with reported results utilizing a the COCO Order dataset, there is a 5.6% improvement
learning rate of 5e-6. over NegCLIP and an impressive 45.6% improvement
over CLIP. In Replace and Add tasks within SugarCrepe,
AMR-NegCLIP exhibits limited improvements when
con4.2. Evaluation Dataset trasted with NegCLIP, with 1.0% in Replace task and
We assess the eficacy of our approach on two widely 0.2% improvement in Add task. This discrepancy can be
used benchmarks for compositional understanding: ARO attributed to the nature of the Replace and Add tasks,
[5] and SugarCrepe [6]. ARO stands for Attribution, which involve modifying concepts within the caption.
Relation, and Order, including four tasks: Visual Genome AMR-NegCLIP generates negatives that maintain the
Relation (VG-Relation) and Visual Genome Attribution same concepts as the positive caption, thereby not
en(VG-Attribution) tasks entail selecting the correct cap- tirely aligning with the task requirements. In contrast,
tion from two options, where negative captions alter another notable observation is a significant improvement,
either the object of the relation or the object’s attribution. 5.9% over NegCLIP, in the Swap task of SugarCrepe, a
Flickr30k Order and COCO Order tasks demand models challenge that proves to be particularly daunting for
preto accurately identify the order of captions from five op- trained CLIP models, as highlighted in the SugarCrepe
tions, where negative captions modify the order of tokens paper [6]. In their study, SugarCrepe authors evaluate
within the caption. SugarCrepe aims to address the issue over ten vision language models and note that ”all models
of negative captions being implausible and non-fluent by struggle at identifying SWAP hard negatives, regardless of
employing large language models to generate fluent and their pertaining dataset and model size.”. This dificulty
challenging negative captions. The dataset encompasses arises from the nature of the swap action in SugarCrepe,
three tasks: Replace, Swap, and Add, which entail vari- which involves neither adding nor excluding any
conous actions aimed at evaluating models’ compositional cepts but rather swapping objects or attributes while
understanding. maintaining fluency and grammatical correctness, a task
demanding a deeper understanding of composition from
vision language models. This closely aligns with our
motivation to employ meaning representations in the</p>
    </sec>
    <sec id="sec-8">
      <title>Understanding the meaning of images has long been a</title>
      <p>goal. Scene graphs have emerged as a popular method
for encoding objects, their attributes, and relationships
within graphs. Abdelsalam et al.’s work [11] discusses
negative generation. Example evaluation data for ARO the diference between AMR and Scene Graphs through
and SugarCrepe are provided in Table 2. detailed statistical analysis on entity and relation
catego</p>
      <p>In Order evaluation dataset, negative samples exhibit rization. Their conclusion highlights that AMR encodes
greater diversity. The introduction of Swap in Sugar- a broader range of relationships, particularly abstract
Crepe aims to rectify instances of textual non-fluency semantic relationships absent in scene graphs.
and implausibility, thereby rendering it more resilient Some studies have also explored leveraging scene
against potential hacking attempts from blind models. graphs to construct negative samples, particularly
fo</p>
      <p>In conclusion, the results indicate that integrating cusing on token swapping, such as swapping
asymmetAMR generated negative captions significantly improves ric relations [15, 10, 5]. These methods have produced
VLM’s performance on various composition tasks, espe- limited variants. However, our approach addresses the
cially dealing with high-level compositional understand- entire semantic representation rather than specific
toing captions. ken swaps. To analyze the diference between outputs,
we present the generated negative samples from
Random Token Swap, Scene Graph Node Swap, and AMR
5. Analysis Reconstruction in Table 3.</p>
      <p>In contrast to Random Token Swap approach,
leverag5.1. Comparison with Scene Graph ing scene graphs yields a richer array of syntactic and
semantic cues. However, the generated negatives
adhere to rule-based criteria, such as swapping exclusively
between adjective words or words sharing a common
relational structure. It is evident that AMR Reconstruction</p>
      <p>CLIP
NegCLIP
introduces a wider spectrum of variations to the original
captions, all while upholding the core semantic
components. Our methodology thus ofers enhanced flexibility
in generating negative training data.</p>
      <p>Furthermore, we compare AMR-NegCLIP with other
negative augmentation-based methods, Semantic
Negative [10], which constructs negative samples using scene cal scores and struggles to diferentiate between captions,
graph node swaps, and CLIP-SGVL [15], which utilizes whereas AMR-NegCLIP excels in selecting the correct
scene graphs in multiple ways, including positive and option. Examples of Replace Relationship and Replace
negative caption generation, as well as scene graph pre- Attribution tasks highlight instances where CLIP
strugdiction tasks, in Table 4. However, the training and vali- gles to discern subtle yet crucial concept replacements.
dation data sets of Semantic Negative are diferent from These nuances have been efectively addressed through
ours, but it can also be seen that it is challenging to im- negative caption contrastive learning.
prove the accuracy of both relationships and attributes by
changing the negative samples. The findings indicate that 5.3. Performance Impact Analysis of AMR
AMR-NegCLIP achieves superior average performance
in comparison to the Semantic Negative method. This Generated Negative Sample Ratios
observation underscores the eficacy of employing AMR AMR generated negative samples tend to distort entire
segenerated negatives, which manifest more pronounced mantic representations of given captions, while NegCLIP
enhancements when compared to the strategy of swap- swaps the positions of tokens. Their generated negative
ping scene graph nodes. Negative sample generation samples address varying levels, from individual objects
rules in CLIP-SGVL are similar to those of Semantic Neg- to complete semantics. To ensure augmented data spans
ative. Our AMR-NegCLIP demonstrated superior perfor- diferent levels in the training dataset, we retain parts
mance in Order tasks with more variants. of negative samples from NegCLIP while replacing a
ratio of NegCLIP samples with AMR generated negative</p>
      <sec id="sec-8-1">
        <title>5.2. Case Study samples.</title>
        <p>To assess the impact of AMR generated negative
samples on model performance, we replace NegCLIP
negatives at ratios ranging from 10% to 100%, and present the
results in Table 5. When replacing only 10% of NegCLIP
negatives with AMR generated negative samples, the
model performance exhibits noticeable improvements,
particularly 6.1% in COCO Order subtasks. The best
performance is achieved when 30% of the token swap
negatives are replaced by AMR-generated negatives. Across
replacement ratios ranging from 10% to 60%, the
integration of AMR generated negatives yields improvements
for NegCLIP across all subtasks. These enhancements are
consistently observed, with average performance gains
ranging from 1.6% to 3.8%. Beyond a 70% replacement
ratio, larger ratios result in decreased model performance.
Specifically, when 90% and 100% of negative samples are</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>We present several case studies illustrating the results of</title>
      <p>CLIP and AMR-NegCLIP across four subtasks in
SugarCrepe, as depicted in Figure 4. SugarCrepe utilizes large
language models to generate captions with a high degree
of fluency and commonsense understanding, thereby
posing a challenge for VLMs to discern negative captions
efectively. For instance, in Swap Object task, VLMs must
comprehend the semantics of relationships such as ”in”
and ”background”, as well as discern the object and
subject of these relationships. Our test results demonstrate
that while CLIP exhibits closely aligned similarity scores
between captions and negative captions, AMR-NegCLIP
demonstrates superior discriminatory capability.
Furthermore, in Swap Attribution task, models are required
to accurately identify quantities and the position of
corresponding objects to succeed. CLIP returns nearly
identi</p>
      <p>Caption: Three Jack-O-Laturns of
various shapes, one of which has 0.273 0.349
flowers in it.</p>
      <p>Negative Caption: Flowers of
various shapes, one of which has 0.288 0.330
Three Jack-O-Lanterns in it.</p>
      <p>Caption: A couple is sitting on a
statue of a horse and some plants. 0.331 0.281</p>
      <p>Swap Object</p>
      <p>AMRCLIP NegCLIP</p>
      <p>Swap Attribution</p>
      <p>AMR</p>
      <p>CLIP NegCLIP
Negative Caption: Some couples
are sitting on a statue of a horse
and a plant.</p>
      <p>Replace Relationship
AMR generated, the performance is inferior to that of to- high-level comprehension. Furthermore, beyond simple
ken swap negatives but still superior to CLIP. The reason shufling, AMR ofers the potential for more controlled
for this phenomenon could be attributed to the greater modifications based on human instructions. For instance,
diversity of AMR generated negatives compared to to- users could add semantic components that are absent in
ken swap negatives. Unlike token swap negatives, which the picture to deliberately confuse VLMs. We view this
follow a unified pattern, AMR generated negatives lack as a promising avenue for future research.
such consistency, making it challenging for models to
efectively learn from them, particularly when the
replacement ratio is high. Therefore, we propose that our
AMR generated negative captions can efectively
complement token swap generations.</p>
    </sec>
    <sec id="sec-10">
      <title>Limitaions Conducting AMR parsing and generation</title>
      <p>typically requires GPU acceleration, which incurs higher
costs compared to direct token shufling methods.
However, when compared to tasks such as scene graph parsing
or querying large language models, it remains an eficient
6. Conclusion approach. It’s worth noting that splitting and shufling
AMR components introduce significant randomness in
To overcome the limitations of vision language models negative generation, and occasionally, this may lead to
in comprehending composition and semantics, we sug- suboptimal results.
gest constructing hard negative samples through splitting
and reconstructing AMR graphs. Compared to token and References
scene graph negative generation, AMR generated
negatives have greater diversity and keep the fluency at the [1] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
most possible. Compared to token and scene graph nega- G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
tive generation, AMR generated negatives exhibit greater J. Clark, et al., Learning transferable visual
moddiversity while maintaining optimal fluency. Our exper- els from natural language supervision, in:
Interimental results illustrate that incorporating our gener- national conference on machine learning, PMLR,
ated negatives in contrastive learning significantly boosts 2021, pp. 8748–8763.
model performance, particularly in tasks that demand
[2] J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrap- hamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart:
ping language-image pre-training for unified vision- Denoising sequence-to-sequence pre-training for
language understanding and generation, in: Inter- natural language generation, translation, and
comnational Conference on Machine Learning, PMLR, prehension, in: Proceedings of the 58th Annual
2022, pp. 12888–12900. Meeting of the Association for Computational
Lin[3] T. Zhao, T. Zhang, M. Zhu, H. Shen, K. Lee, X. Lu, guistics, 2020, pp. 7871–7880.</p>
      <p>J. Yin, Vl-checklist: Evaluating pre-trained vision- [14] L. F. Ribeiro, M. Schmitt, H. Schütze, I. Gurevych,
language models with objects, attributes and rela- Investigating pretrained language models for
graphtions, arXiv preprint arXiv:2207.00221 (2022). to-text generation, in: Proceedings of the 3rd
Work[4] T. Thrush, R. Jiang, M. Bartolo, A. Singh, shop on Natural Language Processing for
ConverA. Williams, D. Kiela, C. Ross, Winoground: Prob- sational AI, 2021, pp. 211–227.
ing vision and language models for visio-linguistic [15] R. Herzig, A. Mendelson, L. Karlinsky, A. Arbelle,
compositionality, in: Proceedings of the IEEE/CVF R. Feris, T. Darrell, A. Globerson, Incorporating
Conference on Computer Vision and Pattern Recog- structured representations into pretrained vision &amp;
nition, 2022, pp. 5238–5248. language models using scene graphs, arXiv preprint
[5] M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, arXiv:2305.06343 (2023).</p>
      <p>J. Zou, When and why vision-language models
behave like bags-of-words, and what to do about
it?, in: The Eleventh International Conference on</p>
      <p>Learning Representations, 2022.
[6] C.-Y. Hsieh, J. Zhang, Z. Ma, A. Kembhavi, R.
Krishna, Sugarcrepe: Fixing hackable benchmarks
for vision-language compositionality, Advances in</p>
      <p>Neural Information Processing Systems 36 (2024).
[7] L. Banarescu, C. Bonial, S. Cai, M. Georgescu,</p>
      <p>K. Grifitt, U. Hermjakob, K. Knight, P. Koehn,
M. Palmer, N. Schneider, Abstract meaning
representation for sembanking, in: Proceedings of the
7th linguistic annotation workshop and
interoperability with discourse, 2013, pp. 178–186.
[8] Z. Shou, Y. Jiang, F. Lin, Amr-da: data augmentation
by abstract meaning representation, in: Findings
of the Association for Computational Linguistics:</p>
      <p>ACL 2022, 2022, pp. 3082–3098.
[9] S. Doveh, A. Arbelle, S. Harary, E. Schwartz,</p>
      <p>R. Herzig, R. Giryes, R. Feris, R. Panda, S. Ullman,
L. Karlinsky, Teaching structured vision &amp; language
concepts to vision &amp; language models, in:
Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2023, pp. 2657–2668.
[10] Y. Huang, J. Tang, Z. Chen, R. Zhang, X. Zhang,</p>
      <p>W. Chen, Z. Zhao, T. Lv, Z. Hu, W. Zhang,
Structureclip: Enhance multi-modal language
representations with structure knowledge, arXiv preprint
arXiv:2305.06152 (2023).
[11] M. A. Abdelsalam, Z. Shi, F. Fancellu, K. Basioti, D. J.</p>
      <p>Bhatt, V. Pavlovic, A. Fazly, Visual semantic parsing:
From images to abstract meaning representation,
arXiv preprint arXiv:2210.14862 (2022).
[12] M. Bevilacqua, R. Blloshmi, R. Navigli, One spring
to rule them both: Symmetric amr semantic parsing
and generation without a complex pipeline, in:
Proceedings of the AAAI Conference on Artificial</p>
      <p>Intelligence, volume 35, 2021, pp. 12564–12573.
[13] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A.
Mo</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>