<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">MADS: A Multi-modal Academic Document Segmentation Dataset for Smart Question Bank Management</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Utathya</forename><surname>Aich</surname></persName>
							<email>utathya.aich@cnh.com</email>
							<affiliation key="aff0">
								<orgName type="institution">CNH Indutrial ITC</orgName>
								<address>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Swarnendu</forename><surname>Ghosh</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Institute of Engineering &amp; Management</orgName>
								<orgName type="institution">University of Engineering &amp; Management</orgName>
								<address>
									<settlement>Kolkata</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tulika</forename><surname>Saha</surname></persName>
							<email>sahatulika15@gmail.com</email>
							<affiliation key="aff2">
								<orgName type="institution">University of Liverpool</orgName>
								<address>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">MADS: A Multi-modal Academic Document Segmentation Dataset for Smart Question Bank Management</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">4721D465625C28189A9BE1E6D53AB855</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:59+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Document Image Analysis</term>
					<term>Multi-modal Document Processing</term>
					<term>Text Classification</term>
					<term>Deep Learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In today's world, most major academic institutes and organizations conduct competitive exams to assess eligibility of students for admission or recruitment. Due to the rising craze among participants, traditional methods are not optimized enough to get ahead in the race. The inclusion of AI enabled tutoring is mandatory for such exams. One such area of implementation is smart question bank management system. Though we have large volumes of questions of competitive exams in physical mode, however, they are harder to process visually for systems as they consist of several types of text and non-text elements such as numbers, equations, images alongside textual paragraphs. For this purpose, we propose MADS, which is a multi-modal academic document segmentation dataset consisting of images of documents containing heterogeneous questions from the competitive exams like GMAT, GRE, GATE, SAT, UGC-NET. These documents consist of textual paragraphs along with numbers, images and equations. The dataset comes with bounding box annotation in two popular format YOLO and PASCAL-VOC formats to aid the development of efficient document segmentation algorithms. Additionally, benchmarks have been provided for state of the art deep learning based implementations such as Faster RCNN and YOLO-v8. From application point of view, the proposed dataset can identify different objects in an image so that later it can be used for semantic relationship and question answering applications enhancing comprehension and personalized learning experiences, thus, supporting the goal of providing quality education.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Competitive examinations are one of the most commonly used tools for academic performance assessment. These are generally conducted for selection of candidates suitable for a specific branch of study or work. There are multiple such exams which have become popular in both the national and international levels. Due to this increase in competition, students and teachers are finding it hard to optimize the preparation process using traditional methods which often leads to depression amongst them <ref type="bibr" target="#b0">[1]</ref>. While e-documents are more suitable for automated systems, it is hard to find organized question banks or materials available in the electronic format. Hard copies of question banks are available but they are difficult to be directly processed as text, as they contain a mixture of texts, equations, images, numbers and so on. One of the major challenges with such documents containing a mixture of several mediums is localizing and segmenting the appropriate textual and non-textual elements. All these components have text like properties and they can mess up standard OCR techniques. The solutions are more scarce when it comes to solving queries containing multi-modal data. This becomes especially prominent for document images that does represent data as a sequence of Unicode characters, but as pixels. To implement a truly multi-modal question answering system, it is essential to segment this various components from complex documents before these advanced image processing tools can be used. For this purpose, we propose "MADS" which is a multimedia academic document segmentation dataset. For this specific work, we are primarily focusing on questions of competitive exams of national and international levels such as GMAT, UGC-NET, GRE, GATE and SAT. This covers a large variety of examinations catering to students of various fields. The images in these documents contain a mixture of equations, diagrams and numbers embedded within the body of the questions along with multiple options to choose from as well. The proposed dataset comes with bounding box annotation corresponding to 4 classes namely equations, diagrams, numbers and texts offering a transformative resource that aligns with Sustainable Development Goal of Quality Education. By meticulously annotating various elements such as text, images, equations, and numbers within question papers, this dataset lays the groundwork for advancing educational research and technology applications. Leveraging this dataset enables the development of innovative tools and algorithms aimed at enhancing teaching methodologies, personalized learning experiences, and educational accessibility. Through the identification of text, images, and equations, educational materials can be optimized for accessibility features such as text-to-speech conversion and alternative formats for students with disabilities. This ensures that all learners, including those with visual impairments or learning disabilities, can access educational content on an equal basis. The availability of the proposed dataset allows for the development of intelligent tutoring systems and question-answering algorithms that promote deeper understanding of educational concepts. Active participation and sustained engagement in the learning process can be obtained through the immediate feedback and adaptive learning pathways.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Contributions :</head><p>The key contribution of this work are as follows : (i) To establish the problem statement for multi-modal academic document image segmentation and its future applications; (ii) Provide a challenging dataset of multi-modal document images consisting of questions from various types of competitive examinations; (iii) To provide with necessary annotation for document image segmentation into 4 classes, namely, equations, numbers, images, and text; and (iv) To provide benchmarks using state-of-the-art detection algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>There has been previous approaches to managing question banks and exam protocols through AI based technologies <ref type="bibr" target="#b1">[2]</ref>. However, most of the approaches deal with already existing electronic question banks <ref type="bibr" target="#b2">[3]</ref>. There has not been much work that can automatically process the already existing large volumes of question banks available in the printed medium in the form of previous year question papers, study materials, educational magazines, and so on. However, there have been several applications of computer vision on multi-modal documents from other domains <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref>. Some of these approaches primarily focus on text and non-text separation in various scenarios <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>. In terms of multi-modal text datasets we have applications in multiple areas that have similar set of challenges to our proposed domain. The Tobacco-3482 <ref type="bibr" target="#b9">[10]</ref> dataset consists of document images belonging to 10 different classes such as forms, letters, resumes, memos, forms and so on. The RVL-CDIP dataset <ref type="bibr" target="#b10">[11]</ref> consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. Multi-label classification have been performed on academic papers to extract components such as titles and keywords <ref type="bibr" target="#b11">[12]</ref>. Moreover, some multi-modal document image datasets that deal with mathematical equations <ref type="bibr" target="#b12">[13]</ref> or geometry <ref type="bibr" target="#b13">[14]</ref> problems have also been explored. In terms of exam related problems, there are some similar works done in specific subject groups such as social or natural sciences <ref type="bibr" target="#b14">[15]</ref> or medical entrance exams <ref type="bibr" target="#b15">[16]</ref>. In these methods there are implementations that address multilingual Q&amp;A problems and also multiple choice based questions. However, after a through survey it is evident that there is a lack of datasets operating in unrestricted domains and provide fundamental annotations regarding the multi-modal contents. Furthermore in the proposed dataset, we are providing samples which do not have unicode representations thus, making it equivalent to digitally scanned print media.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Dataset</head><p>Due to the unavailability of multi-modal question bank dataset through which one can segregate different text and non-textual elements from a given question through document segmentation, we propose "MADS" and discuss its creation below. The sample dataset is publicly available under Creative Commons License (CC) by the authors<ref type="foot" target="#foot_0">1</ref> . </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Data Collection</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Data Annotation</head><p>Next, the task was to annotate the images (typically questions) for extracting relevant information from these images which are question text, image, label and equation. All the sample questions were uploaded to an open-source annotation tool, Label-Studio<ref type="foot" target="#foot_1">2</ref> for creating bounding boxes. Three annotators from the authors' affiliation were asked to draw bounding boxes for these samples through this tool. The annotators were explained and demonstrated the task and then were initially asked to annotate 10 samples each for these four categories present in the image. These samples were then checked by the authors and the errors were resolved if any. The annotators were then finally provided with all the remaining samples equally divided amongst the three for annotation. On an average, there were at least one bounding box present each for image and text class in each sample of the dataset. To create the gold standard annotated dataset, we maintained the Intersection Over Union (IoU) score <ref type="bibr" target="#b24">[27]</ref> between the annotated box to be atleast 80 and in addition to that Cohen kappa score to be greater than 90% for the acceptance of the bounding box with the class label. This Cohen kappa score is the agreement between the annotated labels by the annotators and the authors verifying the annotations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">MADS</head><p>MADS now comprises of 230 question samples annotated for the presence of four categories of information, namely, question text, image, label and equation with the help of bounding box. An annotated sample from MADS is shown in Figure <ref type="figure" target="#fig_0">1</ref>. As is visible, it contains a mixture of equation, text, image and number, and is challenging for machines to identify these said parts in an image easily. Some of these questions contain both numeric, text and equation on the same line. Some images include both image and equation at the same place. In some images questions are in two column format, which makes the dataset more challenging to segment the regions. It is indeed difficult to identify and differentiate amongst these and through MADS we aim to tackle such diverse situations. Largest contribution of the dataset comes from the GATE question which is 32.3% of the whole, followed by UGC-NET, GMAT, GRE and SAT. The distribution of the dataset is shown in Figure <ref type="figure" target="#fig_1">2a</ref>. It has been observed that the dataset exhibits a predominance of text comprising of 5536 bounding boxes which is 75.5% of the annotations. The lowest number of bounding boxes are present with images which is 191 and is 3% of the dataset. The class based statistics is depicted in Figure <ref type="figure" target="#fig_1">2b</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Methodology</head><p>The aim of MADS is to facilitate easy training of models in order to identify different categories in a given image of a question. The dataset can facilitate in identifying different objects in an image which can be later be used for semantic relationship and question answering. The trained model on MADS should then be able to identify and segregate different information present in the question for smart question bank management and facilitate future research directions in this area. In this section, we aim to benchmark MADS using different state of the art vision models for detecting the bounding box.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Benchmark Setup</head><p>We benchmark MADS using two state of the art vision models as follows :</p><p>• YOLO-v8<ref type="foot" target="#foot_2">3</ref> : YOLO-v8 is an advancement of the YOLO <ref type="bibr" target="#b25">[28]</ref>  The pre-trained YOLO-v8 is fine-tuned and Faster R-CNN is trained on MADS to benchmark the dataset using state of the art vision models for the task of detecting useful information in the form of bounding box.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Implementation Details</head><p>MADS is divided into train and test set with a ratio of 85:15. We conducted the experiment five times and reported average of the results based on different models. Vanilla YOLO-v8<ref type="foot" target="#foot_4">5</ref> medium model is fine-tuned on MADS. This model has 25.9 million parameters. Vanilla Faster R-CNN model <ref type="foot" target="#foot_5">6</ref> with ResNet-50 is used at its backend to train on MADS. All the parameters are set to their default values. The learning rate has been set to 0.001, batch size is 64. Number of anchors have been set to 3. As there are 4 classes for detection in MADS, we have 4 output neurons. Confidence threshold has been set to 0.25 by default. YOLO-v8 uses LeakyReLU as its activation function. These parameters might be tuned in future for obtaining better performance. We used two evaluation metrics -Intersection Over Union (IoU) and Mean Average precision (mAP) score to benchmark the performance of the models.</p><p>Evaluation Metrics. The metrics IoU and mAP score are explained as follows:</p><p>• IoU Score: This metric is commonly used to evaluate the performance of object detection algorithms. It measures the overlap between the predicted bounding box and the ground truth. The IoU is calculated using the following formula:</p><formula xml:id="formula_0">𝐼 𝑜𝑈 = Area_of_Overlap / Area_of_Union (1)</formula><p>where, Area_of_Overlap is the area common to both the predicted and ground truth regions and Area_of_Union is the total area covered by both the predicted and ground truth regions. Our experiments are evaluated same threhold of IoU used in COCO.The predicted annotations are evaluated using IoU threshold of 0.5 and 0.9 respectively .</p><p>• Mean Average Precision (mAP): Mean Average Precision is a commonly used metric to evaluate the performance of object detection or information retrieval systems. It provides a single scalar value for two IoU threshold. We first find the average precision of each class then average of all the classes is done to find the mAP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Analysis</head><p>We take the average among all results for each model from the experiments to get the final result of the MADS dataset. Based on the predicted IoU score, we create a threshold of 50% and 90% to record the mAP score. Based on this threshold, we compute different metrics such as accuracy, precision and recall of the models in order to determine its performance. Table <ref type="table" target="#tab_1">1</ref> depicts the accuracy of Yolo-v8 and Faster R-CNN models trained on MADS. As observed, YOLO-v8 performs better than Faster R-CNN by a significant improvement of about 15% in terms of accuracy when the IoU threshold is 50%. Similarly, when the IoU threshold is set to 90%, YOLO-v8 shows about 3% improvement with respect to Faster R-CNN. On an average, it is observed that the Yolo-v8 model showed a standard deviation of ±0.5 and ±1 on IoU threshold of 50% and 90% respectively for the overall accuracy. On the other hand, Faster R-CNN tends to have a standard deviation of ±0.7 and ±2.6 on IoU threshold of 50% and 90% respectively for the same. Table <ref type="table" target="#tab_1">1</ref> also creates a benchmark on the precision and recall for each of the classes by the different models for 50% and 90% threshold. Experimental results noted that the class level precision tends to have a standard deviation of ±3.6 for equation, ±2.2 for image, ±1.9 for number, ±0.9 for text on IoU threshold of 50% for Faster R-CNN. For YOLO-v8 on IoU threshold 50%, precision for class level showed a standard deviation of ±1.5 for equation, ±1.1 for image, ±1.08 for number and ±1.2 for text. YOLO-v8 performs better than Faster R-CNN with a narrow performance improvement of about 3% when the IoU threshold is 90%. On IoU threshold of 90% it is observed that YOLO-v8 has a standard deviation of ±2.1 for equation, ±1.9 for image, ±2.9 for number and ±2.01 for text whereas Faster R-CNN showed a standard deviation of ±3.7 for equation, ±2.4 for image, ±4.7 for number and ±3.5 for text. The reason behind YOLO-v8 superior performance can be attributed to the fact that Faster R-CNN uses two stage detectors during training while YOLO-v8 uses a single shot detector. This gives a huge advantage to YOLO-v8 to look through the whole image at once whereas Faster R-CNN uses regions to localize the object within the image. We also report the precision and recall for individual class labels. The mAP score for the Faster RCNN for IoU50 is 59.6% whereas for IoU90 is 88.37%.YOLO-v8 has a mAP score of 84.25% for IoU50 and 86.15% for IoU90 respectively. It is observed that the text tag seems to be the easiest to identify based on the performance as the dataset has the highest number of text tag annotations. Sample prediction of Figure <ref type="figure" target="#fig_0">1</ref> from the YOLO-v8 model is shown in Figure <ref type="figure" target="#fig_2">3</ref>. With the increase in IoU threshold from 50% to 90%, it is observed that the models are able to correctly classify the different tags. When the threshold is tuned to be 50%, more bounding boxes are identified and there seems to be mis-classification for the same. YOLO-v8 model lacks to classify number tags despite increase in precision for other tags while increasing the threshold from 50% to 90%. Here, Faster R-CNN outperforms YOLO-v8 while identifying number tags on IoU threshold of 90%. Though the YOLO-v8 performs better than Faster R-CNN in almost every scenario, challenges do exists. Both the algorithms faces difficulty while identifying equation and image interchangeably when they are mixed. Isolating such instances while preserving their semantic relationships poses a considerable challenge. Some challenging image snippets are shown in Figure <ref type="figure" target="#fig_3">4</ref>. The models tend to find difficulty in segregating equation and images. These issues can be further resolved by fine-tuning the hyper parameters. Size of the dataset needs to be scaled up (which is an ongoing effort) to achieve a better performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and Future Work</head><p>In this paper, we established a novel problem statement for multi-modal academic document image segmentation and steer discussion focused on its future applications. Due to the unavailability of any such existing dataset relevant to the task, we propose a dataset, namely, MADS consisting of questions from various types of competitive examinations and gold-standard annotations to extract information from these questions through the task of bounding box detection. We benchmark MADS with the help of several state of the art vision models. The dataset exhibits a predominance of text documents compared to other object classes, revealing a bias in the performance of the base algorithms towards text detection. Challenges arise when labels are annotated within the bounding boxes of text. In case of text, characters are distributed in a horizontal and vertical format, meaningful segments can be enclosed in a rectangular bounding box. To address this bias, fine-tuning strategies can be implemented to improve the accuracy for other class labels. This presents an intriguing area for future research, as overcoming these complexities would contribute significantly to the advancement of the field. The primary goal for releasing this dataset is to spur a domain of automated teaching based learning method to aid students appearing for such competitive exams. At its first iteration, this dataset provides the opportunity to digitize existing question banks and annotating them during this process. At this point the dataset primarily focuses on segregation of text, equations, figures, and numbers. Finer segregation may be incorporated in the future versions of the dataset. Future iterations will focus on increasing the volume of the dataset and broadening the domain, embedding multi-modal questions for processing in large language models and vision language models, integrating GPT based services to retrieve solutions for questions, personalized mock test generations and so on. We summarise that this dataset will drive novel research contributions and applications in the field of smart question bank management and education in general.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Samples of MADS from different sources of examination: top row -original question, bottom row -annotated sample question. Red ='Text', Orange ='Number', Yellow ='Image', Blue ='Equation'</figDesc><graphic coords="4,89.29,84.19,416.71,286.49" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Statistics from MADS: (a) Distribution of different source representation, (b) Distribution of different class labels</figDesc><graphic coords="5,291.39,86.26,175.01,98.44" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Predicted samples from YOLO-v8 for the images in Figure 1</figDesc><graphic coords="8,89.29,123.25,416.69,159.38" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Challenging image snippets from MADS</figDesc><graphic coords="9,130.96,84.19,333.36,216.66" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>model. The advanced model is developed by Ultralytics. It has a high rate of accuracy on the COCO dataset 4 . It is an anchor free model which means it predicts the center of an object rather than offset from a known anchor box. This model is more robust to noise and occlusions than other available models. The model uses a new backbone network called Panoptic Feature Extractor (PEE), a new loss function called CIoU loss, and a new training strategy called SimOTA. Ross Girshick developed Faster R-CNN. Compared to past models like R-CNN, a new layer called ROI pooling layer has been proposed in this model. The model is a single stage network in comparison with other previous models. Faster R-CNN does not need much disk storage compared to R-CNN as it does not cache the extracted features.</figDesc><table /><note>• Faster R-CNN:<ref type="bibr" target="#b26">[29]</ref> </note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Average Class wise Precision and Recall of Faster RCNN and YOLO-v8 by set IoU for box overlap at 50% and 90%</figDesc><table><row><cell>Model</cell><cell>Accuracy</cell><cell></cell><cell cols="2">Average Precision</cell><cell></cell><cell></cell><cell cols="2">Average Recall</cell><cell></cell></row><row><cell></cell><cell></cell><cell cols="3">Equation Image Number</cell><cell>Text</cell><cell cols="3">Equation Image Number</cell><cell>Text</cell></row><row><cell cols="2">Faster RCNN @ IoU50 79.1%</cell><cell>47.6%</cell><cell>35.9%</cell><cell>64.7%</cell><cell>90.2%</cell><cell>32.1%</cell><cell>48.3%</cell><cell>71.7%</cell><cell>91.2%</cell></row><row><cell>YOLO-v8 @ IoU50</cell><cell>93.7%</cell><cell>73.4%</cell><cell>80.5%</cell><cell>86.6%</cell><cell>96.5%</cell><cell>69.2%</cell><cell>77.1%</cell><cell>92.02%</cell><cell>96.5%</cell></row><row><cell cols="2">Faster RCNN @ IoU90 94.5%</cell><cell>63.8%</cell><cell>97.5</cell><cell>95.3%</cell><cell>96.9</cell><cell>97.46</cell><cell>100%</cell><cell>49.9%</cell><cell>96.9%</cell></row><row><cell>YOLO-v8 @ IoU90</cell><cell>97.1%</cell><cell>98.3%</cell><cell>97.5%</cell><cell>50.8%</cell><cell>98.02%</cell><cell>66.3%</cell><cell>88.9%</cell><cell>97.6%</cell><cell>98.7%</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/MADS-dataset/MADS_Dataset_official</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://labelstud.io/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://docs.ultralytics.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://cocodataset.org/#home</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://github.com/ultralytics/ultralytics?tab=readme-ov-file</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">5https://pypi.org/project/detecto/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>Dr. Swarnendu Ghosh is thankful for the infrastructure support from IEM Centre of Excellence for Data Science and the Innovation &amp; Entrepreneurship Development Cell, IEM Kolkata.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Assessment of depression, anxiety and stress among students preparing for various competitive exams</title>
		<author>
			<persName><forename type="first">A</forename><surname>Shrivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Rajan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Healthcare Sciences</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="50" to="72" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A systematic review of automatic question generation for educational purposes</title>
		<author>
			<persName><forename type="first">G</forename><surname>Kurdi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Leo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Parsia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Sattler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Al-Emari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Artificial Intelligence in Education</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="121" to="204" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Automatic generation of question paper from user entered specifications using a semantically tagged question repository</title>
		<author>
			<persName><forename type="first">G</forename><surname>Nalawade</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ramesh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Eighth International Conference on Technology for Education (T4E)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="148" to="151" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Visual and textual deep feature fusion for document image classification</title>
		<author>
			<persName><forename type="first">S</forename><surname>Bakkali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ming</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Coustaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rusiñol</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops</title>
				<meeting>the IEEE/CVF conference on computer vision and pattern recognition workshops</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="562" to="563" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Intelligent indexing and semantic retrieval of multimodal documents</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">K</forename><surname>Srihari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Retrieval</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="245" to="275" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Multimodality and genre: A foundation for the systematic analysis of multimodal documents</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bateman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Text/non-text image classification in the wild with convolutional neural networks</title>
		<author>
			<persName><forename type="first">X</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Qi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">66</biblScope>
			<biblScope unit="page" from="437" to="446" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Multi scale mirror connection based encoder decoder network for text localization</title>
		<author>
			<persName><forename type="first">K</forename><surname>Dutta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Basak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ghosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kundu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nasipuri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition Letters</title>
		<imprint>
			<biblScope unit="volume">135</biblScope>
			<biblScope unit="page" from="64" to="71" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Image/text relations and intersemiosis: Towards multimodal text description for multiliteracies education</title>
		<author>
			<persName><forename type="first">L</forename><surname>Unsworth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 33rd IFSC: International Systemic Functional Congress</title>
				<meeting>the 33rd IFSC: International Systemic Functional Congress</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
		<respStmt>
			<orgName>Pontificia Universidade Catolica de Sao Paulo</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Learning document structure for retrieval and classification</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Doermann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012)</title>
				<meeting>the 21st International Conference on Pattern Recognition (ICPR2012)</meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1558" to="1561" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Evaluation of deep convolutional nets for document image classification and retrieval</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">W</forename><surname>Harley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ufkes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">G</forename><surname>Derpanis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2015 13th International Conference on Document Analysis and Recognition (ICDAR)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="991" to="995" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Multi-label classification of research articles using word2vec and identification of similarity threshold</title>
		<author>
			<persName><forename type="first">G</forename><surname>Mustafa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Usman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Afzal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sulaiman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shahid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientific Reports</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page">21900</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Burns</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kadavath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Arora</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Basart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinhardt</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2103.03874</idno>
		<title level="m">Measuring mathematical problem solving with the math dataset</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Solving geometry problems: Combining text and diagram interpretation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Seo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hajishirzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Etzioni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Malcolm</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 conference on empirical methods in natural language processing</title>
				<meeting>the 2015 conference on empirical methods in natural language processing</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1466" to="1476" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Hardalov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mihaylov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zlatkova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dinkov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Koychev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2011.03080</idno>
		<title level="m">Exams: A multisubject high school examinations dataset for cross-lingual and multilingual question answering</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Medmcqa: A large-scale multi-subject multichoice dataset for medical domain question answering</title>
		<author>
			<persName><forename type="first">A</forename><surname>Pal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">K</forename><surname>Umapathi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sankarasubbu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Health, Inference, and Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="248" to="260" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<ptr target="https://eduaims.in/gmat-sample-paper-pdf/,SampleQuestions" />
		<title level="m">GMAT, Gmat sample question paper 2023 with 100 q and a | eduaims</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Hank</forename><surname>Walker</surname></persName>
		</author>
		<ptr target="PracticeProblems" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">M</forename><surname>Prep</surname></persName>
		</author>
		<ptr target="PracticeProblems" />
		<imprint>
			<biblScope unit="volume">5</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Questions</surname></persName>
		</author>
		<ptr target="https://satsuite.collegeboard.org/media/pdf/sat-practice-test-9.pdf,SampleQuestions" />
		<title level="m">Sat study guide 2020 -practice test 9</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Questions</surname></persName>
		</author>
		<ptr target="https://satsuite.collegeboard.org/media/pdf/sat-practice-test-10.pdf,SampleQuestions" />
		<title level="m">Sat study guide 2020 -practice test 10</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Questions</surname></persName>
		</author>
		<ptr target="https://satsuite.collegeboard.org/media/pdf/sat-practice-test-3.pdf,SampleQuestions" />
		<title level="m">Sat study guide 2020 -practice test 3</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">U</forename><surname>Net</surname></persName>
		</author>
		<ptr target="https://www.ugcnetonline.in/previous_question_papers.php,OfficailQuestionpapers" />
		<imprint/>
		<respStmt>
			<orgName>University grants commission -net</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Gate</forename><surname>Gate</surname></persName>
		</author>
		<ptr target="https://gate.iitkgp.ac.in/old_question_papers.html,OfficialQuestionpapers" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note>official site</note>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<ptr target="https://en.wikipedia.org/wiki/Jaccard_index,IOUSimilarity" />
		<title level="m">Jaccard index -wikipedia</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">You only look once: Unified, realtime object detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Redmon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Divvala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">B</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
		<idno>CoRR abs/1506.02640</idno>
		<ptr target="https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf" />
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Faster r-cnn: Towards real-time object detection with region proposal networks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
