<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Connecting Text and Images in News Articles using VSE++</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Abbhinav</forename><surname>Elliah</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Engineering</orgName>
								<orgName type="institution">Sri Sivasubramaniya Nadar College of Engineering</orgName>
								<address>
									<settlement>Chennai</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><surname>Mirunalini</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Engineering</orgName>
								<orgName type="institution">Sri Sivasubramaniya Nadar College of Engineering</orgName>
								<address>
									<settlement>Chennai</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Department of Information Technology</orgName>
								<orgName type="institution">Sri Sivasubramaniya Nadar College of Engineering</orgName>
								<address>
									<settlement>Chennai</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Haricharan</forename><surname>Bharathi</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Engineering</orgName>
								<orgName type="institution">Sri Sivasubramaniya Nadar College of Engineering</orgName>
								<address>
									<settlement>Chennai</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Anirudh</forename><surname>Bhaskar</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science and Engineering</orgName>
								<orgName type="institution">Sri Sivasubramaniya Nadar College of Engineering</orgName>
								<address>
									<settlement>Chennai</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Department of Information Technology</orgName>
								<orgName type="institution">Sri Sivasubramaniya Nadar College of Engineering</orgName>
								<address>
									<settlement>Chennai</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Connecting Text and Images in News Articles using VSE++</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">EB5CEB7D047E74C1FB93D5A3F1838496</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Using a large dataset of headlines, excerpts, and related images, we examine the complex link between linguistic and visual aspects in news items in this study. In Mediaeval 2023, we are entrusted with identifying patterns to explain the relationships between text and visuals while taking a number of variables into account. The text features were extracted using the BERT model, and the image features were extracted using the CNN model EfficientNet-b0. The extracted features of image and text are then used to train the VSE++ model. which helps us to establish the relationship between the text and the images. Our study, which places a strong emphasis on the model, attempts to clarify the intricate dynamics of the connectivity between the text and images.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Online news stories in the digital age combine text and visuals to produce a dynamic and interesting read. Understanding the relationship between text and images allows for a more nuanced and detailed insight of the information being conveyed. It also aids in fact checking to verify the accuracy of the news articles. It is not simple to understand the complex link that exists between text and visuals in news items. Any deep learning transformer models will help us to understand the relationship that exists between the text and images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>According to Ali and Paccosi <ref type="bibr" target="#b0">[1]</ref>, Multimodal Understanding of Smells in Texts and Images (MUSTI) aims to analyze the relatedness of smells between digital text and image collections from the 17th to 20th century in a multilingual context, introducing a binary classification task to identify text-image pairs that contain references to the same smell source and an optional sub-task to determine the specific smell sources.</p><p>This paper by Zhang and Lu <ref type="bibr" target="#b1">[2]</ref> introduces the Cross-Modal Projection Matching (CMPM) loss and Cross-Modal Projection Classification (CMPC) loss to enhance image-text matching. The CMPM loss minimizes KL divergence between projection compatibility and normalized matching distributions for positive and negative samples. The CMPC loss categorizes vector projections with an improved norm-softmax loss, aiming to compactly represent each class. The proposed approach demonstrates superiority through extensive analysis and experiments on multiple datasets, addressing challenges in accurately measuring similarity for real applications.</p><p>MediaEval'23: Multimedia Evaluation Workshop, February 1-2, 2024, Amsterdam, The Netherlands and Online abbhinav2210396@ssn.edu.in (A. Elliah); miruna@ssn.edu.in (M. P); keerthick2210372@ssn.edu.in (K. V); haricharan2010267@ssn.edu.in (H. Bharathi); anirudh2010094@ssn.edu.in (A. Bhaskar); vithula2210417@ssn.edu.in (V. S) Yin and Chen in <ref type="bibr" target="#b2">[3]</ref> propose a method for precise image and text retrieval in complex multimodal environments. Utilizes improved feature extraction with 2-dimensional principal component analysis (2DPCA) for images and LSTM with word vectors for text. Interactive learning through a dual-modal CAE achieves accurate cross-modal retrieval. Experimental results on multiple datasets demonstrate superior performance, surpassing other methods in mean average precision (MAP) and precision-recall rate (PR) curves. Yu and Yao in <ref type="bibr" target="#b3">[4]</ref> introduce a cross-modal Remote Sensing (RS) image retrieval method using Graph Neural Network (GNN). Addresses the challenge of information misalignment between query text and RS images. Proposes a feature matching network with GNN to learn feature interaction and association between text and RS images. Employs text and RS image graph modules and a multi-head attention mechanism for effective fusion and matching. Experimental results on standard datasets demonstrate competitive performance. The previous edition by Lommatzsch and Kille in <ref type="bibr" target="#b4">[5]</ref> focuses on developing innovative methodologies to accurately reassociate news articles with corresponding images, understanding the complexities of linking news texts and images using the impact of AI-generated iges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Proposed Approach</head><p>In our exploration of image-text relations in news articles, we employ cutting-edge methods. Convolutional Neural Networks (CNNs) like EfficientNet extract image features, while Bidirectional Encoder Representations from Transformers (BERT) handles text. As the features of different texts are of different length, we apply a padding process to ensure a cohesive connection between these features. Our ensemble approach combines CNNs, BERT, and the padding process to boost accuracy. The objective of the proposed work is to fine-tune the model's parameters to capture intricate visual patterns relevant to news articles. The proposed EfficientNet model is trained on a large dataset of diverse images and the pre-trained BERT model has been trained using the dataset that consists of English text and also English translated German texts. This step equips the model with a deep understanding of language nuances, enabling it to extract meaningful features from news article texts.The padding process involves aligning the extracted image and text features to ensure compatibility and enhances the choerence of the features. The propsed work uses Visual-Semantic Embedding (VSE++) model for seamless integration of image and text features extracted from the given datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Implementation and Experiments</head><p>The features extracted from the datasets have been elaborated below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Feature Extractors</head><p>BERT employs a transformer-based architecture that facilitates bidirectional learning, allowing the model to capture contextual information from both preceding and subsequent words for improved language representation. EfficientNet-b0 features a baseline architecture with compound scaling, systematically increasing model depth, width, and resolution to achieve an optimal balance between computational efficiency and classification accuracy in image tasks. Fine tuning of both image and text model's representations ensures multimodal integration to enhance its capacity to extract meaningful visual features, to adapt the model to the specific nuances of the dataset at hand, enhancing its effectiveness in capturing domain-specific semantic relationships.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Visual Semantic Embeddings</head><p>The integration of text and image representations is orchestrated through the Visual-Semantic Embedding (VSE++) model. Faghri and Fleet in <ref type="bibr" target="#b5">[6]</ref> present a novel technique, VSE++, for improving visual-semantic embeddings for cross-modal retrieval by incorporating hard negatives in the loss function, resulting in significant gains in retrieval performance, as demonstrated through experiments on the MS-COCO and Flickr30K datasets. VSE++ employs a multimodal architecture that learns joint embeddings for images and text and acts as a bridge to enhance the alignment of visual and semantic representations through positive instance pairs. This model encapsulates the essence of cross-modal understanding, enabling the system to discern semantic similarities between textual descriptions and corresponding images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Methodology</head><p>We leverage BERT for textual representation, exploiting its bidirectional transformer to capture comprehensive language context. The resulting text features, crucial for tasks requiring nuanced understanding, are zero-padded for uniform length.</p><p>On the visual front, we employ EfficientNet B0, a computationally efficient yet potent CNN tailored for image classification. Its adeptness at capturing diverse feature levels in images is harnessed, and these features are resized to align with the dimensions of the text features for multimodal integration. Then VSE++ model was fine tuned and used for training based on joint embeddings, comparing the features of text and images obtained for the text and corresponding images from the given datasets consisting of english and german news article's text and images, with the required paramters used for our datasets. The hyper parameters, learning rate and number of epochs were set to 0.001 and 100 respectively and we are training the model in batches of 100 text and image features in each batch. The model weights are saved in a pytorch file and are the weights are added further for the successive batches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Analysis</head><p>The performance of the proposed architecture was evaluated using the metrics namely Match@N, Mean Reciprocal Rank and Mean Recall@ k.</p><p>In the evaluation of information retrieval systems based on the provided metrics, the performance of three distinct runs reveals noteworthy insights. In the context of the English datasets both runs exhibited commendable capabilities in identifying relevant results, with approximately 7% of relevant predictions within the top 100. However, one dataset outperformed its counterpart, showcasing higher mean recall values across various thresholds and a more favorable mean reciprocal rank (MRR) at 100, suggesting a superior ranking of relevant results. Conversely, a non-English dataset, demonstrated a comparatively lower performance, with only 2.67% of relevant results identified within the top 100. The MRR and recall values further indicated a reduced ability to retrieve pertinent information in comparison to the English runs.</p><p>This analysis underscores the nuanced performance of the information retrieval model across diverse datasets. While the English runs exhibited robust performance metrics, the disparities observed in the German dataset highlight potential challenges in cross-lingual information retrieval. The results emphasize the importance of considering linguistic variations and datasetspecific characteristics in system evaluations.</p><p>In summary, these findings provide valuable insights for optimizing information retrieval systems, emphasizing the need for tailored strategies to enhance performance across diverse linguistic contexts with the help of metrics-Match@N, Mean Reciprocal Rank and Mean Recall@ k. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Prediction flowchart used for the predicting the relationship between image and text using CNN, BERT and VSE++ Model.</figDesc><graphic coords="2,89.29,442.56,541.69,217.38" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 Performance Metrics Prediction File Metric Top 10 Top 50 Top 100 Matches MRR (at 100)</head><label>1</label><figDesc></figDesc><table><row><cell>run3/eng1.txt</cell><cell>Recall 0.00600 0.03533 0.06867</cell><cell>103/1500</cell><cell>0.00427</cell></row><row><cell>run3/eng2.txt</cell><cell>Recall 0.00600 0.03200 0.07200</cell><cell>108/1500</cell><cell>0.00281</cell></row><row><cell cols="2">run3/german.txt Recall 0.00167 0.01433 0.02667</cell><cell>80/1500</cell><cell>0.00102</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion and Outlook</head><p>In this competition, we have built a model based on the foundation and precedents established by previous work. But we could not get the desired output exactly due to time constraints because of unavailability of required resources due to the cyclone Michaung.</p><p>We emphasise that similarity models from multiple methods produces the best results. Future work would ideally experiment further with different parameters, different base estimators, and different techniques.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Mustimultimodal understanding of smells in texts and images at mediaeval</title>
		<author>
			<persName><forename type="first">H</forename><surname>Ali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Paccosi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Menini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Mathias</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pasquale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kiymet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Raphaël</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Erp</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of MediaEval 2022 CEUR Workshop</title>
				<meeting>MediaEval 2022 CEUR Workshop</meeting>
		<imprint>
			<date type="published" when="2022">2022. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Deep cross-modal projection learning for image-text matching</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the European conference on computer vision (ECCV)</title>
				<meeting>the European conference on computer vision (ECCV)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="686" to="701" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A cross-modal image and text retrieval method based on efficient feature extraction and interactive learning cae</title>
		<author>
			<persName><forename type="first">X</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientific Programming</title>
		<imprint>
			<biblScope unit="page" from="1" to="12" />
			<date type="published" when="2022">2022. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Text-image matching for cross-modal remote sensing image retrieval via graph neural network</title>
		<author>
			<persName><forename type="first">H</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>You</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="812" to="824" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Newsimages: Connecting text and images in mediaeval</title>
		<author>
			<persName><forename type="first">A</forename><surname>Lommatzsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Kille</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Özgöbek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Elahi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D.-T</forename><surname>Dang-Nguyen</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Proceedings of the MediaEval 2023 Workshop</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting><address><addrLine>Amsterdam, The Netherlands</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023-01-02">2023. 1-2 February 2024. 2023</date>
		</imprint>
	</monogr>
	<note>and Online and Online</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Faghri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Fleet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Kiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Fidler</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1707.05612</idno>
		<title level="m">Vse++: Improving visual-semantic embeddings with hard negatives</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
