<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Enhancing Semantic Understanding in Vision Language Models Using Meaning Representation Negative Generation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ziyi</forename><surname>Shou</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Fangzhen</forename><surname>Lin</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">HKUST-Xiaoi Joint Laboratory Department of Computer Science</orgName>
								<orgName type="institution">Engineering</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">Hong Kong University of Science and Technology</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Enhancing Semantic Understanding in Vision Language Models Using Meaning Representation Negative Generation</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">506DF5DDAAEC639EEE41F43B342A40B0</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Vision Language Models</term>
					<term>Semantic Understanding</term>
					<term>Compositional Understanding</term>
					<term>Abstract Meaning Representation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Vision language models have been criticized for their performance resembling bag-of-words models, lacking semantic understanding. Efforts to address this concern have included the integration of composition aware negative samples into contrastive learning methodologies. However, current negative generation methods show restricted semantic comprehension, diversity, and fluency. To tackle this issue, we propose leveraging Abstract Meaning Representation (AMR), a representation of considerable interest in natural language processing research, for negative sample generation. By altering the structure of the meaning representation, we create negative samples with entirely different meanings but share close plain paraphrases. These negatives, generated using AMR, are then incorporated alongside token swap negatives during contrastive training.</p><p>Our results indicate that AMR generated negatives introduce significantly diverse patterns. Furthermore, the inclusion of AMR generated negative samples enhances the models' performance across a range of compositional understanding tasks.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In recent years, the conspicuous development of vision language models (VLMs) across various tasks is evident <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3]</ref>. However, VLMs have been criticized for performing akin to bag-of-words models, lacking semantic understanding, especially compositional understanding <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b4">5]</ref>. For instance, when some tokens in the caption of an image-caption pair are rearranged to result in an unaligned caption, a VLM may fail to notice the change. Consider the two image-caption pairs in Figure <ref type="figure" target="#fig_0">1</ref>. In the left side pair, the phrases "Three Jack-O-Lanterns" and "flowers" in its caption are swapped, resulting in a semantically very different sentence. But CLIP fails to notice the difference and somehow gives the modified caption a slightly higher similarity score. A similar effect can be seen in the right side image-caption pair, when the phrases "Clock tower" and "a bronze statue" in its caption are swapped. These are not isolated examples. As Yuksekgonul et al. <ref type="bibr" target="#b4">[5]</ref> pointed out, VLMs "behave like bags-of-words" because they have been mostly pretrained on large-scale web datasets for retrieval tasks where image and caption matching can often be done using keywords alone.</p><p>A straightforward and effective solution involves mining hard negative samples for contrastive learning. This entails including negative instances with similar semantic components but distinct relationships in the same KiL'24: Workshop on Knowledge-infused Learning co-located with 30th ACM KDD Conference, August 26, 2024, Barcelona, Spain Envelope zshou@cse.ust.hk (Z. Shou); flin@cse.ust.hk (F. Lin) batch, challenging the model to discern the correct caption amidst such variations. For example, NegCLIP <ref type="bibr" target="#b4">[5]</ref> constructs negative image captions by swapping tokens. However, token swap methods lack semantic understanding, resulting in patterns, and lack of plausibility and fluency. Blind Models trained solely on text, without considering images, may manipulate evaluations to their advantage <ref type="bibr" target="#b5">[6]</ref>.</p><p>Meaning representations offer an alternative approach to constructing negative samples with greater diversity and fluency. Abstract Meaning Representation (AMR, <ref type="bibr" target="#b6">[7]</ref>) stands out as a prevalent semantic representation in text tasks and is valued for its high expressiveness and human-friendly comprehension, which encodes concepts as nodes and depicts the relationships between concepts through graphical representations. We propose to utilize AMR to create negative samples that possess entirely distinct meanings but share close plain paraphrases. To achieve this, we modify the structure of meaning representation by randomly shuffling the positions of subtrees within AMR graphs and reconstructing meaning representations. Following this process, negative captions are generated from the new meaning representations using an AMR generator. We blend our generated negatives with token swap negatives to broaden the diversity of negative samples and enhance generalization. Subsequently, vision language models undergo training to distinguish between true labels and negative samples.</p><p>Our findings indicate that incorporating negative samples generated from meaning representations improves model performance across diverse compositional understanding benchmarks. Additionally, our generated nega- tives introduce various patterns, enriching the diversity of augmentations compared to token swap negatives.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">AMR Data Augmentation</head><p>AMR encodes concepts as nodes and illustrates the relationships between these concepts as edges. It has been shown to be advantageous in various natural language processing tasks, such as data augmentation. Token edit data augmentations in NLP often result in generating illformed or incoherent sentences, as they do not consider sentence structures. AMR Data Augmentation (AMR-DA) <ref type="bibr" target="#b7">[8]</ref> suggests utilizing AMR for data augmentation. They construct positive samples by meticulously controlling minor nuances within a carefully designed framework for meaning representation. Consequently, they produce several fluent and distinct positive augmentations for the given sentences. Inspired by AMR-DA, we explore the utilization of AMR in compositional understanding tasks for vision language models. However, our approach diverges significantly; rather than focusing on careful modifications to meaning representation for positive sample generation, we propose employing AMR for negative sample generation. Our methodology involves splitting the meaning representation and shuffling its components to construct a new negative representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Composition-aware Hard Negatives</head><p>For generating negative captions for contrastive learning, a straightforward approach involves modifying linguistic elements. To improve compositional understanding, <ref type="bibr" target="#b4">[5]</ref> leverage Spacy for syntactic analysis to identify and swap the positions of two elements within the caption. The token swap modifications aimed at creating variations in composition are relatively straightforward to implement but often struggle to maintain grammaticality. Moreover, they can be vulnerable to exploitation, as the patterns of modification may become predictable even without considering information from the image encoder. <ref type="bibr" target="#b8">[9]</ref> initially parse the syntactic structure of the caption. They then randomly mask text and utilize a large language model to unmask and generate a new negative caption. While the resulting caption tends to exhibit improved grammatical correctness, the modification process lacks fine control, and the generated variants remain somewhat constrained in scope. To address the limitations of semantic modification, <ref type="bibr" target="#b9">[10]</ref> proposes leveraging scene graphs to generate semantic negative captions. They implement a strategy where they interchange the positions of the subject and object within the same relation, as well as swap the attributes of different objects. However, the modification of scene graphs is limited. Compared to scene graphs, meaning representations encode a more extensive range of relations, especially higher-level abstract semantic relations absent in scene graphs <ref type="bibr" target="#b10">[11]</ref>. This suggests that meaning representations have a higher potential to improve downstream tasks that require an understanding of higher-level semantic information in images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Extensive Contrastive Learning</head><p>The aim of contrastive learning is to bring similar representations into closer proximity while simultaneously pushing apart dissimilar samples. This principle mirrors its application within vision language model training, exemplified by Contrastive Language-Image Pre-Training (CLIP, <ref type="bibr" target="#b0">[1]</ref>), which has emerged as a prominent paradigm in vision language learning. The training objective of CLIP is to align text-image pairs effectively. CLIP simultaneously trains an image encoder and a text encoder to extract feature representations from each modality, denoted as 𝐼 𝑛 for image features and 𝑇 𝑛 for text features. These features are then utilized to compute scaled pairwise cosine</p><formula xml:id="formula_0">𝐼! ⋅ 𝑇! … 𝐼! ⋅ 𝑁𝑇! … 𝐼! ⋅ 𝑇! " … 𝐼! ⋅ 𝑁𝑇! " … … 𝑁𝐼! ⋅ 𝑇! … 𝑁𝐼! ⋅ 𝑁𝑇! … 𝑁𝐼! ⋅ 𝑇 ! " … 𝑁𝐼! ⋅ 𝑁𝑇 ! " … …</formula><p>A small child wearing headphones plays on the computer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Image Encoder Text Encoder</head><p>A young professional is working at his laptop while his coworker is reading material.</p><p>Children's headphones are small enough to wear while using the computer to play.</p><p>When I was a young reader , my professional work was on a laptop with a co -worker .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Hard Images</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Captions for Hard Images</head><p>Negative Captions for Original Images Captions for Original Images</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Original Images</head><p>Negative Captions for Hard Images similarities, serving as logits. Finally, a symmetric crossentropy loss is computed over these similarity scores to guide the training process effectively.</p><p>In response to the challenge of vision language models struggling to comprehend text composition, we adopt the approach proposed by Yuksekgonul et al. <ref type="bibr" target="#b4">[5]</ref>, which introduced two extensive components to standard contrastive learning, aimed at increasing the complexity of model learning. This entails (1) introducing challenging images for the image encoder to extract features from, selected based on CLIP encoding and utilizing nearest neighbors of original images, and (2) incorporating hard negative captions for the text encoder to distinguish features. The difference is that we add AMR generated negative samples into hard negative captions, with modifications aimed at preserving most plain text tokens while completely distorting the semantic meaning. Figure <ref type="figure" target="#fig_1">2</ref> illustrates the training pipeline. In each batch, original images 𝐼 𝑛 and their nearest neighbors 𝑁 𝐼 𝑛 are included. Corresponding captions 𝑇 𝑛 and 𝑁 𝑇 𝑛 are concatenated with hard negative captions 𝑇 − 𝑛 and 𝑁 𝑇 − 𝑛 , doubling the length of captions compared to the number of images. Subsequently, a symmetric cross-entropy loss is computed as in CLIP. However, only column-wise loss for positive captions is incorporated, as negative captions lack corresponding images for comparison.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">AMR for Negative Sample Generation</head><p>Contrary to token swap negative generation, we propose to the generation of negative samples using AMR. AMR encodes the semantics into graphs and has demonstrated effectiveness as an intermediate representation in natural language augmentation tasks. We adopt a similar pipeline to AMR-DA <ref type="bibr" target="#b7">[8]</ref>: parsing sentences into AMR, modifying the AMR, and generating samples from the modified AMR. However, our objective differs significantly from that of AMR-DA. While they meticulously modify the intermediate AMR to construct positive samples, our task requires generating entirely different semantic representations, albeit with the same semantic components as given samples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1.">Meaning Representation</head><p>Abstract Meaning Representation (AMR, <ref type="bibr" target="#b6">[7]</ref>) is a rooted, directed graph that encodes sentence concepts as nodes and the relations between these concepts as directed edges. In Figure <ref type="figure" target="#fig_2">3</ref>, the leftmost portion depicts the AMR graph corresponding to the caption "A trunk carries a large amount of items and a few people. " In this graph, the root "carry" serves as the primary predicate of the sentence, with "trunk" designated as the first argument (denoted as ARG0) of "carry, " while the subtree originating from "and" represents the second argument. AMR A truck carries a large amount of items and a few people.</p><p>The items are carried by a few large trucks and an amount of people . facilitates readability for both human and machine comprehension and can be adapted to various purposes as needed. In this study, our proposal involves splitting the AMR graph, shuffling its components, and then reconstructing a new AMR graph. This process aims to create a hard negative graph where all semantic parts are retained, but the overall meaning is distorted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2.">Generation Pipeline</head><p>The entire pipeline is illustrated in Figure <ref type="figure" target="#fig_2">3</ref>. We adopt AMR-DA pipeline, which involves initially parsing the caption into an AMR graph using an AMR parser. Subsequently, we modify this AMR graph and finally utilize an AMR generator to produce negative captions based on the modified AMR. We utilize SPRING parser <ref type="bibr" target="#b11">[12]</ref> as our AMR parser. SPRING employs a depth-first search method to linearize AMRs and utilizes a special token &lt; 𝑅𝑛 &gt; to manage co-referring nodes. The parser is trained based on BART model <ref type="bibr" target="#b12">[13]</ref>. After obtaining the AMR graph for the caption, we propose a split and reconstruct algorithm to construct a new AMR graph, which is described in detail in the subsequent paragraphs. Finally, we employ PLMs-Generator <ref type="bibr" target="#b13">[14]</ref> based on T5-base as our AMR generator to convert AMR to text. The modelbased generator exhibits tolerance, allowing for the accommodation of certain unreasonable aspects within our modified graph. AMR generator can rectify to some ex-tent and produce new samples closely resembling the given graph, this flexibility provides greater latitude for modifying the AMR graph compared to rule-based methods. For instance, in Figure <ref type="figure" target="#fig_2">3</ref>, although the modified graph contains some illogical elements such as the node "and" lacking children, the generator is still capable of generating fluent and grammatically correct text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.3.">AMR Split and Reconstruct</head><p>The key component of generating negative samples through AMR lies in our split and reconstruct algorithm.</p><p>Unlike existing methods that rely on token swapping within the sentence or node swapping in the scene graph based on predefined rules, our approach offers greater flexibility by directly modifying the entire meaning representations. Modifications to AMR afford a broader range of possibilities owing to the diverse types of edges and nodes present.</p><p>In our algorithm, we split the AMR graph and regard the root node as a separate entity, while treating other nodes along with their incoming edges as edge-node pairs. As illustrated in Figure <ref type="figure" target="#fig_2">3</ref>, the left-hand side depicts the AMR graph corresponding to the original caption "A trunk carries a large amount of items and a few people. " Following the split process, we obtain a root node and a collection of edge-node pairs such as "carry, [(:ARG0, trunk), (:ARG1, and), ...]". Next, we proceed to reconstruct a semantic tree by randomly concatenating nodes from the split parts. We shuffle the list of edge-node pairs and sequentially select edge-node pairs one by one. The process begins at layer 1 with the root node. At this stage, the first node has only one option, which is to connect to the root node and move to layer 2. Subsequently, at layer 2, the subsequent nodes have two options: either to remain at layer 2 by connecting to the root node, or to move to a deeper layer by connecting to the previous node at layer 2. If a node moves to a deeper layer, for instance, layer 3, the subsequent node has three options: to remain at the current layer, to move deeper, or to move back to the previous layer. This iterative process continues until all nodes are connected within the semantic tree. In Figure <ref type="figure" target="#fig_2">3</ref>, when considering the pair (:mod, large), there are indeed three options available. The node "large" can either remain at the current layer by connecting to the node "trunk", proceed to a deeper layer by connecting to the node "few", or revert back to connect with the root node. The shuffled AMR entails reordering all nodes along with their edges except the root node, resulting in a new representation of meaning. Negative captions are then generated based on this shuffled AMR. The algorithm to reconstruct AMR graph is illustrated in Algorithm 1.</p><p>The distinction between negative AMR generation and AMR-DA lies in their respective objectives. AMR-DA aims to regulate modifications to avoid distorting the overall semantic meaning of the sentence by selectively adding or removing nuanced semantic components. On the other hand, negative AMR generation focuses on retaining the majority of the semantic components while generating entirely different semantic representations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>We conduct experiments on different evaluation datasets to explore the impact of AMR generated negatives on the performance of vision language models in compositional understanding tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Experimental Settings</head><p>We explore whether AMR generated negatives improve the performance of model compositional understanding, so we follow the training setups in NegCLIP <ref type="bibr" target="#b4">[5]</ref>, which finetune CLIP based on the ViT-B/32<ref type="foot" target="#foot_0">1</ref> on the COCO dataset with token swap hard negatives.</p><p>For negative captions, we assign a specific probability to replace the original token swap caption with AMR generated negative augmentation. In the main results, the possibility of replacing negatives in NegCLIP is set at 30%. In other words, about 30% of the captions with our AMR generated hard negative captions, while the NegCLIP initially sets the batch size to 1024. However, due to device limitations, we are constrained to train the model on a single NVIDIA RTX 2080 Ti GPU, reducing our batch size to 32. Consequently, we adjust the warm-up steps to 1600. Contrastive learning relies on batch size, as it involves contrasting samples within each batch. Therefore, larger batch sizes are anticipated to yield greater improvements. We employ the AdamW optimizer with a cosine annealing schedule for a training epoch of 5. The learning rate is explored within the range of 1e-5, 5e-6, 1e-6, with reported results utilizing a learning rate of 5e-6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Evaluation Dataset</head><p>We assess the efficacy of our approach on two widely used benchmarks for compositional understanding: ARO <ref type="bibr" target="#b4">[5]</ref> and SugarCrepe <ref type="bibr" target="#b5">[6]</ref>. ARO stands for Attribution, Relation, and Order, including four tasks: Visual Genome Relation (VG-Relation) and Visual Genome Attribution (VG-Attribution) tasks entail selecting the correct caption from two options, where negative captions alter either the object of the relation or the object's attribution. Flickr30k Order and COCO Order tasks demand models to accurately identify the order of captions from five options, where negative captions modify the order of tokens within the caption. SugarCrepe aims to address the issue of negative captions being implausible and non-fluent by employing large language models to generate fluent and challenging negative captions. The dataset encompasses three tasks: Replace, Swap, and Add, which entail various actions aimed at evaluating models' compositional understanding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Main Results</head><p>We incorporate AMR generated negative samples into our contrastive training data, simplifying our method to AMR-NegCLIP. In this study, we undertake a comparative analysis of the outcomes generated by our AMR-NegCLIP approach in contrast to the results produced by several baseline models, ViT-B-32, standard CLIP finetuned with COCO dataset (CLIP), and CLIP finetuned with tokenlevel hard negatives (NegCLIP).</p><p>From Table <ref type="table" target="#tab_1">1</ref>, we can find that our AMR-NegCLIP achieves superior performance across all subtasks. In Visual Gnome dataset, AMR-NegCLIP gets a 2.2% improvement in Relation task over NegCLIP and a 4.6% improvement in Attribution task. In Flickr30k Order dataset, there is a 2.9% improvement compared to Neg-CLIP and a substantial 34.4% improvement over CLIP. In the COCO Order dataset, there is a 5.6% improvement over NegCLIP and an impressive 45.6% improvement over CLIP. In Replace and Add tasks within SugarCrepe, AMR-NegCLIP exhibits limited improvements when contrasted with NegCLIP, with 1.0% in Replace task and 0.2% improvement in Add task. This discrepancy can be attributed to the nature of the Replace and Add tasks, which involve modifying concepts within the caption. AMR-NegCLIP generates negatives that maintain the same concepts as the positive caption, thereby not entirely aligning with the task requirements. In contrast, another notable observation is a significant improvement, 5.9% over NegCLIP, in the Swap task of SugarCrepe, a challenge that proves to be particularly daunting for pretrained CLIP models, as highlighted in the SugarCrepe paper <ref type="bibr" target="#b5">[6]</ref>. In their study, SugarCrepe authors evaluate over ten vision language models and note that "all models struggle at identifying SWAP hard negatives, regardless of their pertaining dataset and model size.". This difficulty arises from the nature of the swap action in SugarCrepe, which involves neither adding nor excluding any concepts but rather swapping objects or attributes while maintaining fluency and grammatical correctness, a task demanding a deeper understanding of composition from vision language models. This closely aligns with our motivation to employ meaning representations in the</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Example evaluation data of Visual Genome Relation, Flickr30k Order in ARO; Replace, Swap and Add in SugarCrepe. The italicized text represents a positive caption for the sample, while the other lines contain negative captions. Visual Genome includes two captions per sample, whereas Order test set includes five captions per sample.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Visual Genome Relation</head><p>the door is to the left of the shirt. the shirt is to the left of the door.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Flickr30k Order</head><p>A group of people standing on the lawn in front of a building. Many people in blue jeans stand in front of a white church.</p><p>A large group of people stand outside of a church. Family members standing outside a home. People standing outside of a building.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>SugarCrepe Replace</head><p>A tan toilet and sink combination in a small room.</p><p>A white toilet and sink combination in a small room.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>SugarCrepe Swap</head><p>Three large horses eating hay while a small horse stands behind.</p><p>A small horse eating hay while three horses stand behind.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>SugarCrepe Add</head><p>Two zebras are battling each other on hind legs. Two striped-and-spotted zebras are battling each other on hind legs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Negative Sentences generated using Random Token Swap, Scene Graph Node Swap and AMR Reconstruction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Source</head><p>A truck carries a large amount of items and a few people.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Random Token Swap</head><p>A amount carries a large truck of items and a few people .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Scene Graph Node Swap</head><p>A truck carries a few amount of items and a large people.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AMR Reconstruction</head><p>The items are carried by a few large trucks and an amount of people .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Source</head><p>A pigeon greets three bicyclists on a park path.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Random Token Swap</head><p>A park greets three bicyclists on a pigeon path .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Scene Graph Node Swap</head><p>A bicyclist greets three pigeon on a park path.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AMR Reconstruction</head><p>Greetings , three pigeon bicyclers on the path have been parkled .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Source</head><p>People walking pass a horse drawn carriage sitting at the curb.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Random Token Swap</head><p>People walking pass a horse drawn curb sitting at the carriage.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Scene Graph Node Swap</head><p>People sitting at a horse drawn carriage walking pass the curb.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AMR Reconstruction</head><p>People walking by the curb , horse sitting , carriage pulling .</p><p>negative generation. Example evaluation data for ARO and SugarCrepe are provided in Table <ref type="table">2</ref>.</p><p>In Order evaluation dataset, negative samples exhibit greater diversity. The introduction of Swap in Sugar-Crepe aims to rectify instances of textual non-fluency and implausibility, thereby rendering it more resilient against potential hacking attempts from blind models.</p><p>In conclusion, the results indicate that integrating AMR generated negative captions significantly improves VLM's performance on various composition tasks, especially dealing with high-level compositional understanding captions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Analysis</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Comparison with Scene Graph</head><p>Understanding the meaning of images has long been a goal. Scene graphs have emerged as a popular method for encoding objects, their attributes, and relationships within graphs. Abdelsalam et al.'s work <ref type="bibr" target="#b10">[11]</ref> discusses the difference between AMR and Scene Graphs through detailed statistical analysis on entity and relation categorization. Their conclusion highlights that AMR encodes a broader range of relationships, particularly abstract semantic relationships absent in scene graphs.</p><p>Some studies have also explored leveraging scene graphs to construct negative samples, particularly focusing on token swapping, such as swapping asymmetric relations <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b4">5]</ref>. These methods have produced limited variants. However, our approach addresses the entire semantic representation rather than specific token swaps. To analyze the difference between outputs, we present the generated negative samples from Random Token Swap, Scene Graph Node Swap, and AMR Reconstruction in Table <ref type="table">3</ref>.</p><p>In contrast to Random Token Swap approach, leveraging scene graphs yields a richer array of syntactic and semantic cues. However, the generated negatives adhere to rule-based criteria, such as swapping exclusively between adjective words or words sharing a common relational structure. It is evident that AMR Reconstruction introduces a wider spectrum of variations to the original captions, all while upholding the core semantic components. Our methodology thus offers enhanced flexibility in generating negative training data. Furthermore, we compare AMR-NegCLIP with other negative augmentation-based methods, Semantic Negative <ref type="bibr" target="#b9">[10]</ref>, which constructs negative samples using scene graph node swaps, and CLIP-SGVL <ref type="bibr" target="#b14">[15]</ref>, which utilizes scene graphs in multiple ways, including positive and negative caption generation, as well as scene graph prediction tasks, in Table <ref type="table" target="#tab_2">4</ref>. However, the training and validation data sets of Semantic Negative are different from ours, but it can also be seen that it is challenging to improve the accuracy of both relationships and attributes by changing the negative samples. The findings indicate that AMR-NegCLIP achieves superior average performance in comparison to the Semantic Negative method. This observation underscores the efficacy of employing AMR generated negatives, which manifest more pronounced enhancements when compared to the strategy of swapping scene graph nodes. Negative sample generation rules in CLIP-SGVL are similar to those of Semantic Negative. Our AMR-NegCLIP demonstrated superior performance in Order tasks with more variants.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Case Study</head><p>We present several case studies illustrating the results of CLIP and AMR-NegCLIP across four subtasks in Sugar-Crepe, as depicted in Figure <ref type="figure" target="#fig_4">4</ref>. SugarCrepe utilizes large language models to generate captions with a high degree of fluency and commonsense understanding, thereby posing a challenge for VLMs to discern negative captions effectively. For instance, in Swap Object task, VLMs must comprehend the semantics of relationships such as "in" and "background", as well as discern the object and subject of these relationships. Our test results demonstrate that while CLIP exhibits closely aligned similarity scores between captions and negative captions, AMR-NegCLIP demonstrates superior discriminatory capability. Furthermore, in Swap Attribution task, models are required to accurately identify quantities and the position of corresponding objects to succeed. CLIP returns nearly identi- </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Performance Impact Analysis of AMR Generated Negative Sample Ratios</head><p>AMR generated negative samples tend to distort entire semantic representations of given captions, while NegCLIP swaps the positions of tokens. Their generated negative samples address varying levels, from individual objects to complete semantics. To ensure augmented data spans different levels in the training dataset, we retain parts of negative samples from NegCLIP while replacing a ratio of NegCLIP samples with AMR generated negative samples.</p><p>To assess the impact of AMR generated negative samples on model performance, we replace NegCLIP negatives at ratios ranging from 10% to 100%, and present the results in Table <ref type="table" target="#tab_3">5</ref>. When replacing only 10% of NegCLIP negatives with AMR generated negative samples, the model performance exhibits noticeable improvements, particularly 6.1% in COCO Order subtasks. The best performance is achieved when 30% of the token swap negatives are replaced by AMR-generated negatives. Across replacement ratios ranging from 10% to 60%, the integration of AMR generated negatives yields improvements for NegCLIP across all subtasks. These enhancements are consistently observed, with average performance gains ranging from 1.6% to 3.8%. Beyond a 70% replacement ratio, larger ratios result in decreased model performance. Specifically, when 90% and 100% of negative samples are  AMR generated, the performance is inferior to that of token swap negatives but still superior to CLIP. The reason for this phenomenon could be attributed to the greater diversity of AMR generated negatives compared to token swap negatives. Unlike token swap negatives, which follow a unified pattern, AMR generated negatives lack such consistency, making it challenging for models to effectively learn from them, particularly when the replacement ratio is high. Therefore, we propose that our AMR generated negative captions can effectively complement token swap generations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>To overcome the limitations of vision language models in comprehending composition and semantics, we suggest constructing hard negative samples through splitting and reconstructing AMR graphs. Compared to token and scene graph negative generation, AMR generated negatives have greater diversity and keep the fluency at the most possible. Compared to token and scene graph negative generation, AMR generated negatives exhibit greater diversity while maintaining optimal fluency. Our experimental results illustrate that incorporating our generated negatives in contrastive learning significantly boosts model performance, particularly in tasks that demand high-level comprehension. Furthermore, beyond simple shuffling, AMR offers the potential for more controlled modifications based on human instructions. For instance, users could add semantic components that are absent in the picture to deliberately confuse VLMs. We view this as a promising avenue for future research.</p><p>Limitaions Conducting AMR parsing and generation typically requires GPU acceleration, which incurs higher costs compared to direct token shuffling methods. However, when compared to tasks such as scene graph parsing or querying large language models, it remains an efficient approach. It's worth noting that splitting and shuffling AMR components introduce significant randomness in negative generation, and occasionally, this may lead to suboptimal results.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Example test results of the model's relational understanding. CLIP gives higher similarity scores for unaligned captions.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Extensive CLIP for compositional understanding tasks through extensive training with hard neighbor images and AMR generated hard negative captions.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Negative example generated based on AMR. The shuffled AMR entails reordering all nodes along with their edges except the root node..</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>Jack-O-Laturns of various shapes, one of which has flowers in it. 0.273 0.349 Caption: A city street with a rainbow in the background. 0.313 0.391 Negative Caption: Flowers of various shapes, one of which has Three Jack-O-Lanterns in it. couple is sitting on a statue of a horse and some plants. 0.331 0.281 Caption: A tennis player poses, racket in his right hand, left arm behind him. 0.304 0.256 Negative Caption: Some couples are sitting on a statue of a horse and a plant. 0.336 0.240 Negative Caption: A tennis player poses, racket in his left hand, right arm behind him. skiers are walking and skiing around the snow. 0.292 0.316 Caption: Two giraffes in a sanctuary standing close to the wall. 0.310 0.305 Negative Caption: Many skiers are riding and skiing around the snow. 0.293 0.285 Negative Caption: Two giraffes in a sanctuary standing far from the wall. 0.315 0.289</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Predictions of CLIP and AMR-NegCLIP on SugarCrepe tasks: Swap Object, Swap Attribution, Replace Relationship and Replace Attribution. The score represents the similarity score between the (Negative) caption and the corresponding image as assessed by CLIP/AMR-NegCLIP. The model selects the caption with the higher similarity score as the correct one.</figDesc><graphic coords="9,90.77,251.34,80.23,66.05" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>ARO and SugarCrepe results comparison of AMR-NegCLIP with different models.</figDesc><table><row><cell></cell><cell></cell><cell>ARO</cell><cell></cell><cell></cell><cell></cell><cell>SugarCrepe</cell><cell></cell></row><row><cell></cell><cell cols="2">Visual Gnome</cell><cell cols="2">Flickr30k COCO</cell><cell cols="3">All Datasets Avg</cell></row><row><cell></cell><cell cols="2">Relation Attribution</cell><cell>Order</cell><cell>Order</cell><cell cols="3">Replace Swap Add</cell></row><row><cell>ViT-B-32</cell><cell>51.1</cell><cell>61.3</cell><cell>47.2</cell><cell>37.1</cell><cell>80.8</cell><cell>63.3</cell><cell>75.1</cell></row><row><cell>CLIP</cell><cell>59.9</cell><cell>63.2</cell><cell>59.5</cell><cell>46.0</cell><cell>84.8</cell><cell>70.8</cell><cell>85.6</cell></row><row><cell>NegCLIP</cell><cell>81.0</cell><cell>71.0</cell><cell>91.0</cell><cell>86.0</cell><cell>85.4</cell><cell>75.3</cell><cell>87.3</cell></row><row><cell>AMR-NegCLIP</cell><cell>83.2</cell><cell>75.6</cell><cell>93.9</cell><cell>91.6</cell><cell>86.4</cell><cell>81.2</cell><cell>87.5</cell></row><row><cell cols="3">remainder with original token swap negative samples, are</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">utilized for contrastive training. This approach ensures a</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">diverse range of negatives is maintained. The comparison</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">of different probabilities is included in Section 5.3. For</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">each image, one of the three nearest negative neighbors,</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">determined by CLIP encoding, is sampled as the hard</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>image.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 4</head><label>4</label><figDesc>ARO performance comparison of different strategies. † : results from<ref type="bibr" target="#b9">[10]</ref>, applying semantic negative strategy; ‡ : results from<ref type="bibr" target="#b14">[15]</ref>, incorporating Scene Graph Prediction in training.</figDesc><table><row><cell></cell><cell cols="2">Visual Gnome</cell><cell cols="2">Flickr30k COCO</cell></row><row><cell></cell><cell cols="2">Relation Attribution</cell><cell>Order</cell><cell>Order</cell></row><row><cell>CLIP</cell><cell>59.9</cell><cell>63.2</cell><cell>59.5</cell><cell>46.0</cell></row><row><cell>NegCLIP</cell><cell>81.0</cell><cell>71.0</cell><cell>91.0</cell><cell>86.0</cell></row><row><cell>AMR-NegCLIP</cell><cell>83.2</cell><cell>75.6</cell><cell>93.9</cell><cell>91.6</cell></row><row><cell>Semantic Negative  †</cell><cell>79.0</cell><cell>77.8</cell><cell>-</cell><cell>-</cell></row><row><cell>CLIP-SGVL  ‡</cell><cell>-</cell><cell>-</cell><cell>82.0</cell><cell>78.2</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 5</head><label>5</label><figDesc>Comparison of ARO performance before and after replacing a portion of original negative samples with AMR generated negative samples.</figDesc><table><row><cell cols="3">Visual Gnome Relation Attribution</cell><cell cols="3">Flickr30k COCO Average Order Order</cell></row><row><cell>CLIP</cell><cell>59.9</cell><cell>63.2</cell><cell>59.5</cell><cell>46.0</cell><cell>57.2</cell></row><row><cell>NegCLIP</cell><cell>81.0</cell><cell>71.0</cell><cell>91.0</cell><cell>86.0</cell><cell>82.3</cell></row><row><cell>Replace Ratio</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>10%</cell><cell>83.4</cell><cell>74.4</cell><cell>94.1</cell><cell>92.1</cell><cell>86.0</cell></row><row><cell>20%</cell><cell>82.6</cell><cell>76.0</cell><cell>92.9</cell><cell>90.3</cell><cell>85.4</cell></row><row><cell>30%</cell><cell>83.2</cell><cell>75.6</cell><cell>93.9</cell><cell>91.6</cell><cell>86.1</cell></row><row><cell>40%</cell><cell>83.8</cell><cell>74.8</cell><cell>91.3</cell><cell>88.3</cell><cell>84.5</cell></row><row><cell>50%</cell><cell>82.6</cell><cell>74.3</cell><cell>94.0</cell><cell>90.6</cell><cell>85.4</cell></row><row><cell>60%</cell><cell>81.2</cell><cell>75.1</cell><cell>91.5</cell><cell>87.6</cell><cell>83.9</cell></row><row><cell>70%</cell><cell>80.3</cell><cell>71.9</cell><cell>93.7</cell><cell>91.8</cell><cell>84.4</cell></row><row><cell>80%</cell><cell>80.2</cell><cell>71.2</cell><cell>93.2</cell><cell>91.5</cell><cell>84.0</cell></row><row><cell>90%</cell><cell>78.4</cell><cell>71.3</cell><cell>89.3</cell><cell>86.4</cell><cell>81.4</cell></row><row><cell>100%</cell><cell>75.0</cell><cell>69.4</cell><cell>83.4</cell><cell>80.9</cell><cell>77.2</cell></row><row><cell cols="6">cal scores and struggles to differentiate between captions,</cell></row><row><cell cols="6">whereas AMR-NegCLIP excels in selecting the correct</cell></row><row><cell cols="6">option. Examples of Replace Relationship and Replace</cell></row><row><cell cols="6">Attribution tasks highlight instances where CLIP strug-</cell></row><row><cell cols="6">gles to discern subtle yet crucial concept replacements.</cell></row><row><cell cols="6">These nuances have been effectively addressed through</cell></row><row><cell cols="4">negative caption contrastive learning.</cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/openai/CLIP</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="8748" to="8763" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Blip: Bootstrapping language-image pre-training for unified visionlanguage understanding and generation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="12888" to="12900" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2207.00221</idno>
		<title level="m">Vl-checklist: Evaluating pre-trained visionlanguage models with objects, attributes and relations</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Winoground: Probing vision and language models for visio-linguistic compositionality</title>
		<author>
			<persName><forename type="first">T</forename><surname>Thrush</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bartolo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kiela</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ross</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="5238" to="5248" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">When and why vision-language models behave like bags-of-words, and what to do about it?</title>
		<author>
			<persName><forename type="first">M</forename><surname>Yuksekgonul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bianchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kalluri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Eleventh International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Hsieh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kembhavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Krishna</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Abstract meaning representation for sembanking</title>
		<author>
			<persName><forename type="first">L</forename><surname>Banarescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bonial</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Georgescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Griffitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Hermjakob</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Knight</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Koehn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Palmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Schneider</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th linguistic annotation workshop and interoperability with discourse</title>
				<meeting>the 7th linguistic annotation workshop and interoperability with discourse</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="178" to="186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Amr-da: data augmentation by abstract meaning representation</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Shou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: ACL 2022</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="3082" to="3098" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Teaching structured vision &amp; language concepts to vision &amp; language models</title>
		<author>
			<persName><forename type="first">S</forename><surname>Doveh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Arbelle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Harary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schwartz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Herzig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Giryes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Feris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Panda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ullman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Karlinsky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="2657" to="2668" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lv</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.06152</idno>
		<title level="m">Structureclip: Enhance multi-modal language representations with structure knowledge</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Abdelsalam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fancellu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Basioti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Bhatt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Pavlovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fazly</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2210.14862</idno>
		<title level="m">Visual semantic parsing: From images to abstract meaning representation</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">One spring to rule them both: Symmetric amr semantic parsing and generation without a complex pipeline</title>
		<author>
			<persName><forename type="first">M</forename><surname>Bevilacqua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Blloshmi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Navigli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="12564" to="12573" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ghazvininejad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stoyanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="7871" to="7880" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Investigating pretrained language models for graphto-text generation</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F</forename><surname>Ribeiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schmitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI</title>
				<meeting>the 3rd Workshop on Natural Language Processing for Conversational AI</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="211" to="227" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Herzig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mendelson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Karlinsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Arbelle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Feris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Globerson</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.06343</idno>
		<title level="m">Incorporating structured representations into pretrained vision &amp; language models using scene graphs</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
