<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">An Integrated System for Interacting with Multi-Page Scholarly Documents</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Lorenzo</forename><surname>Massai</surname></persName>
							<email>lorenzo.massai@unifi.it</email>
							<affiliation key="aff0">
								<orgName type="institution">DINFO -University of Florence</orgName>
								<address>
									<addrLine>via S. Marta, 3</addrLine>
									<settlement>Firenze</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Simone</forename><surname>Marinai</surname></persName>
							<email>simone.marinai@unifi.it</email>
							<affiliation key="aff0">
								<orgName type="institution">DINFO -University of Florence</orgName>
								<address>
									<addrLine>via S. Marta, 3</addrLine>
									<settlement>Firenze</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">An Integrated System for Interacting with Multi-Page Scholarly Documents</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">5020E977CD439ACB2FA33EF313248949</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Natural language processing</term>
					<term>document layout analysis</term>
					<term>conversational agents</term>
					<term>retrieval augmented generation</term>
					<term>large language models</term>
					<term>question answering</term>
					<term>document understanding</term>
					<term>linked data</term>
					<term>scholarly document processing</term>
					<term>multi-modal feature extraction</term>
					<term>text mining</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this work we present a preliminary version of a comprehensive interface for supporting users to interact with scholarly documents, enabling multi-layered exploration and offering deeper insights by integrating diverse features and contextual information. By bridging diverse information our work pursues the identification, characterization, and linking of visual elements to semantic and context data, leveraging large language models for interoperability. Recent advances in retrieval augmented generation are also exploited to address some language models limitations, allowing them to access latent information from document representations such as graph and vector embeddings.</p><p>The system under development performs an analysis of input documents and enables the extraction of visual and semantic features, making them accessible in a comprehensive framework. The association of structural information to visual data allows formal analysis of documents and is exploited in our model to enhance visual extraction, performing a novel ontology-based constraint violation detection. The information extracted through this framework is semantically explorable, providing access to the document structure, which can be exploited in many applications like question answering and document understanding.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In recent years digital analysis of documents has gained attention due to the massive process of online media publishing and to the large availability of shared knowledge. Narrowing the field to the scientific literature context, readers are able to understand document meaning exploiting different kinds of contextual information like layout information and geometric properties of the elements which come along with text. The production of scientific literature is typically shared with unstructured media like PDF or images and getting automatic access to different kinds of knowledge requires to make several disjoint queries, linking data from different sources and keeping the original context at the same time.</p><p>This paper aims at extending the research field of Visual Document Understanding (VDU) in the scientific literature domain through the association of text semantics to visual features, merging them in a shared structure which allows multi-modal exploration. The main challenges which are addressed in this work can be found in the following areas.</p><p>Semantic segmentation. The association of semantics to visual data is a key research problem in computer vision, including tasks like object recognition, image captioning, and image segmentation. In document analysis the goal is to understand contents by extracting geometric properties of visual elements such as figures, tables, text and layout elements such as columns, footnotes, titles, classifying them into semantic categories. Most document understanding systems are limited to text blocks and figure/text classification, lacking contextual information and domain-specific recognition (i.e. listings, formulas, and chemical structures). The scientific literature, with its variety of visual and text data, is particularly suitable for semantic segmentation and can be used to estimate associations between representations. However, these representations are limited by their lack of interoperability and moreover relying on images restricts analysis to one page. When considering the whole document, the problem becomes much more complex since inter-page relations and semantic regions spanning through multiple pages must be considered.</p><p>Semantic integration. The integration of different document attributes can be pursued merging extracted information in shared structures, either for retrieving information about visual and layout elements in the document, or to get context information about the publication such as the author and the research field. The presence of a formal structure helps maintain and extend context; relations can also be exploited to identify structural constraint violations like overlapping layout regions and to allow category-based searches. To achieve such awareness multiple layers of the same data have to be considered and an exhaustive semantic characterization of the entities is necessary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Layer interoperability.</head><p>Recent trends for navigating different layers of information go towards intelligent agents which are aware of the subject being asked of, its context, and the context of who asks. Such agents are able to understand questions and to provide coherent answers spanning through different layers of information, adapting solutions and recommendations as the conversation evolves and learning what to say also from the dialogue. Visual Question Answering systems' dependency on document images reflects in limited awareness of the whole document and lack of any contextual information or specific domain recognition. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. One of the goals of this work is to link different layers of information to visual media across the whole document and make them interoperable through conversational agents.</p><p>This paper presents original contributions that advance the fields of analysis and interaction with scholarly documents. By integrating different media representations, this work pursues the enhancement of document understanding and interaction, addressing specific challenges in document processing. In particular, our main achievements can be identified in:</p><p>• building a comprehensive interface to allow multi-page interaction with scholarly documents; • performing explainable association of layout information to visual and text data; • enhancing detection of visual recognition anomalies exploiting semantic constraints; • allowing multi-layer interoperability through large language models. These contributions enhance the accessibility, explainability, and interoperability of scholarly document analysis, enabling semantic processing and navigation of academic papers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head><p>The review of the state-of-the-art focuses on three distinct and related research areas: document segmentation, semantic linking, and visual question answering.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Document segmentation</head><p>Various approaches exist for identifying layout elements in visually structured documents, typically targeting specific types like tables <ref type="bibr" target="#b0">[1]</ref>, formulas <ref type="bibr" target="#b1">[2]</ref>, bibliographic references <ref type="bibr" target="#b2">[3]</ref> and many others. Most approaches rely on OCR and TEI-XML conversion; for multi-page documents, current methods like HRDOC <ref type="bibr" target="#b3">[4]</ref> use Mask R-CNN and language models to extract semantic regions and their relations.</p><p>Regarding the conversion from PDF to a more structured format such as TEI-XML, Grobid <ref type="bibr" target="#b2">[3]</ref> is considered one of the ten best tool for extracting bibliography data from document images <ref type="bibr" target="#b4">[5]</ref>, allowing multi-page analysis and being also capable of extracting other layout elements. Whole document analysis increases the problem complexity; to this end some efforts have been made to extract relations between pages in the form of triples <ref type="bibr" target="#b5">[6]</ref>.</p><p>Several multi-purpose datasets exist in the area of scholarly document understanding, focusing on layout analysis, text and visual elements extraction, and document structure identification. The largest datasets have to deal with the multiplicity of different layouts which are present in scholarly articles, addressing the complications of storing different data types into suitable structures. For this reason the most extended sources of information rely on flexible data containers like XML and JSON formats, which allow enough versatility for managing such a variety of descriptive data. Among recent datasets for scholarly documents layout analysis Publaynet <ref type="bibr" target="#b6">[7]</ref> and DocBank <ref type="bibr" target="#b7">[8]</ref> are considered the most relevant, although they exhibit limited variability in contents and layout.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Semantic linking</head><p>There are technical and pragmatic reasons to pursue abstract representations of knowledge with Linked (Open) Data and ontologies. Using natural language processing and computer vision strategies to obtain searchable content does not ensure the maintenance of the visual or logic structure of the original data, which is essential for data context analysis and is necessary to perform structured queries and inference. The definition of a structure capable of hosting data extracted from raw sources allows to keep the context and easily extend it, exploiting relations which exist among data and that are not explicitly declared.</p><p>The most encouraging effort in the direction of a unified structure for modeling scholarly documents can be found in the Semantic Publishing And Referencing (SPAR) ontologies project <ref type="bibr" target="#b8">[9]</ref>, <ref type="bibr" target="#b9">[10]</ref>, which includes several ontologies that are depicted in Figure <ref type="figure" target="#fig_0">1</ref>. SPAR ontologies integrate models such as the Document Components Ontology (DoCO) <ref type="bibr" target="#b10">[11]</ref> which, in turn, includes pattern ontologies, discourse elements ontologies, bibliographic resources ontologies, citation ontologies <ref type="bibr" target="#b11">[12]</ref>, and many others describing different aspects of scholarly documents.</p><p>The Document Components Ontology is composed of a rhetorical and a structural layer: rhetorical classes describe logical entities such as references, bibliographic references, captions, introduction, materials, methods, results, related work and future work. The structural layer links rhetorical elements with structural components like title, section titles, paragraphs, footnotes, tables, figures, captioned boxes, figure boxes, lists, bibliographic reference list, front matter, body matter, back matter, chapters, sections, bibliography, and abstract. Each class defines semantic relations with other classes, e.g. the class Sentence includes DiscourseElement when it is found with the attribute inline. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Visual question answering</head><p>Visual Question Answering (VQA) represents the main point of contact between the communities of natural language processing and computer vision. Technologies such as conversational agents and chatbots <ref type="bibr" target="#b12">[13]</ref> are suitable for this purpose. These technologies can interface with neural networks and ontologies <ref type="bibr" target="#b13">[14]</ref>, exploiting their functionalities like graph reasoning <ref type="bibr" target="#b14">[15]</ref> for extending context. Question answering systems can be integrated and trained to respond to questions on both visual and contextual information. Retrieval Augmented Generation (RAG) <ref type="bibr" target="#b15">[16]</ref> can further enhance these capabilities by combining retrieval to access custom knowledge bases and provide more accurate answers.</p><p>Toolformer <ref type="bibr" target="#b16">[17]</ref> and KnowledGPT <ref type="bibr" target="#b17">[18]</ref> integrate knowledge bases to Large Language Models (LLMs) with program-of-thought prompting, allowing questions requiring broader context knowledge. An effective application of RAG to scholarly articles can be found in ChatDOC<ref type="foot" target="#foot_0">1</ref> and in PaperQA <ref type="bibr" target="#b18">[19]</ref>, describing RAG agents that can answer scientific questions. Document images pose distinct challenges due to their spatially organized elements and the combination of visual and textual information. To this end LayoutLM <ref type="bibr" target="#b19">[20]</ref> introduces 2D position embeddings, merging visual and text embeddings.</p><p>The main limitation of current research in scholarly documents VQA can be found in its reliance on page images, restricting the analysis to single-pages and disregarding semantic context. Some efforts have been made in this direction <ref type="bibr" target="#b20">[21]</ref>. VQA datasets supporting multi-page documents are hard to find; among the most recent, comprehensive resources can be found in the MP-DocVQA dataset <ref type="bibr" target="#b21">[22]</ref>, the GRAM dataset <ref type="bibr" target="#b22">[23]</ref> and the DUDE dataset <ref type="bibr" target="#b23">[24]</ref>. The DUDE dataset includes a wide range of document types and sources, covering diverse topics and layouts, and allows full support for multi-page analysis, however having limited layout semantics. The lack of valuable multi-page datasets can also be addressed through document generation <ref type="bibr" target="#b24">[25]</ref>. The most comprehensive resource for scholarly document analysis, in the best of our knowledge, is the Semantic Scholar Open Research Corpus (S2ORC) dataset <ref type="bibr" target="#b25">[26]</ref>. S2ORC is composed by 8.1M open-access PDF-parsed papers across different academic disciplines and offers full reproducibility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">System architecture</head><p>The proposed architecture (Figure <ref type="figure" target="#fig_1">2</ref>) is aimed at extracting different layers of information from multipage scholarly articles exploiting state-of-the-art tools; future work is represented as dashed elements and bracketed labels. To achieve a comprehensive characterization of different kinds of layout elements, document data is extracted with vision, natural language, and semantic technologies. Information is made accessible altogether through conversational agents based on language models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Document segmentation module</head><p>To extract geometric information a segmentation strategy aimed at identifying layout categories and their properties is presented. The PDF articles are converted to TEI-XML format through the Grobid API<ref type="foot" target="#foot_1">2</ref> in order to estimate the PDF structure as an XML tree. The resulting output contains the recognized structures which are title, doi, keywords, abstract, authors, authors data, emails, tables, figures, captions, formulas, dates, sections/subsections, acknowledgments, bibliographic entries and raw text blocks. Positional information includes page number and is present for most classes. Some structures have deeper characterization, for instance author consolidation is made through the integration with CrossRef APIs.</p><p>The output of Grobid processing is parsed through Beautiful Soup to extract the TEI tags and serialize the information into key-value pairs (Figure <ref type="figure" target="#fig_2">3</ref>). The coordinates of semantic elements are then used to draw bounding boxes on the original document and to associate a label to each semantic region. The hierarchy of the document is also extracted. The recognized layout elements that are provided with geometric information are highlighted in the user interface through bounding boxes (Figure <ref type="figure" target="#fig_3">4</ref>) and the information that is not provided with spatial data is associated to them as linked pop-ups.</p><p>The features in development that are related to the document segmentation module are represented in Figure <ref type="figure" target="#fig_1">2</ref> as dashed elements and bracketed labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Semantic linking module</head><p>The semantic characterization assigned to the extracted information is derived from the Document Components Ontology <ref type="bibr" target="#b10">[11]</ref>, a specialized ontology designed for modeling the layout elements of scholarly and research documents. system, enabling a structured and standardized representation of document components. To ensure compatibility, a detailed mapping process is performed between the Grobid XML tags used for document parsing and the corresponding DoCO classes. This mapping is carried out by aligning the typical organization and content structure of a research article, ensuring semantic coherence and consistency across the extracted data. As detailed in Section 5 the semantic characterization of layout elements is leveraged to enhance visual recognition, exploiting the relations defined among the ontology classes to detect unfounded overlaps.</p><p>The features in development that are related to the semantic linking module are represented in Figure <ref type="figure" target="#fig_1">2</ref> as dashed elements and bracketed labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Question answering module</head><p>The question answering module is managed exploiting LLMs, specifically the Llama3 model through the Ollama Python API. This model has been chosen because of its ease of installation and integration with custom scripts and external resources. The question answering module is designed to enable the LLM to access the serialized output of Grobid, which is stored in a shared memory (Figure <ref type="figure" target="#fig_2">3</ref>). The results obtained with Llama3 are excellent, ensuring an adequate understanding of the questions given the resources provided and their eventual lack. The response is fairly fast, even though the local installation does not have access to significant computational resources. The user question is augmented and proposed to the LLM in the form: "Given that: log_data, question" where log_data represents the output of the modules described in Sections 3.1 and 3.2 and question is the query input by the user through the user interface. The system context is given to the LLM as:</p><p>"The questions will be about a scholarly article from which some data has been extracted in structured form and given as context. "  The context length for an off-the shelf language model as Llama3 is set to 2048 characters, which is restrictive both for the output of the Grobid and DoCO modules and for the extracted key-value pairs. To address the LLM context length issues and limiting the context to the part that is most pertinent to the question, the user interface described in Section 4 allows the user to provide a classification of the questions choosing any number of labels among: Article_title, Author, Abstract, Caption, Caption_Figure, Figure <ref type="figure">, Table, Formula</ref>, Section, Link, Note, Acknowledgments, and Reference. These classes correspond to Grobid-extracted TEI-XML tags, which are mapped to the DoCO ontology entities. The labels provided by the user are exploited to split the context to be given to the LLM, retaining only the portions that constitute the object of the classification.</p><formula xml:id="formula_0">&lt;div</formula><p>The features in development that are related to the LLM module are represented in Figure <ref type="figure" target="#fig_1">2</ref> as dashed elements and bracketed labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">User interface</head><p>The user interface (Figure <ref type="figure" target="#fig_3">4</ref>) is composed of five web pages interacting with the Python server that routes the user choices. Through this interface the user is able to upload a PDF document (Upload page) and to process it (PDF processor), extracting information which is exploited by the LLM for accessing information. Upon PDF processing, its output is shown to the user, which is the document augmented with bounding boxes that span the classes described in Section 3.1. The process of drawing bounding boxes is managed by generating a separate PDF layer for each class, each layer being assigned distinct colors, and then overlapping these layers onto the original input PDF to visually represent the annotations. The whole process is completed in a variable amount of time, mainly depending on the input PDF length and network capabilities, since Grobid is used as a network service. Processing a 10/20-page PDF takes a few seconds, while longer papers may require more than 10 seconds. The serialized information is used as context for the LLM, which is included in the interface to facilitate user interaction and exploration of the system's functionalities. The LLM computation time for each question varies based on local GPU capabilities, generally taking a few seconds. Since LLMs have context length issues, the context of the question is chosen by the user filtering by question topic, which can be any number of layout classes, as detailed in Section 3.3. The interface includes a graphical preview of data, which consists of PDF images with bounding boxes overlaying the layout elements associated with coordinates in the Grobid output, each provided with a layout element label. In addition, informative pop-ups containing all data retrieved by data processing are present. The user interface includes also a specific perspective (Overlap violations) that is designed to outline the semantic integration described in Section 5. The purpose of this view is to highlight the layout elements that overlap violating the imported ontology constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Constraint violations analysis</head><p>To understand how the relations defined in an ontology can be applied to visual extracted classes and how they can improve the classification of layout elements, we analyze the interactions among the DoCO ontology classes <ref type="foot" target="#foot_2">3</ref> . The idea is to exploit the ontology relations that occur between layout elements to check the admissibility of geometric overlaps. Since the assertions defined in the ontology can involve objects lacking spatial characterization in the Grobid-extracted counterpart, it is essential to identify the relations among objects with coordinates. Then, a geometric notion for each relation between class instances in the format (page, x, y, width, height) has to be defined. Afterwards, ontology constraints can be applied to detect visual recognition errors like overlap errors (Figure <ref type="figure" target="#fig_4">5</ref>).</p><p>We distinguish overlaps as geometric overlaps and semantic overlaps, since the former can be admissible, while the latter are most likely recognition errors. We need to determine whether an overlap is admissible To formally determine which two-dimensional elements overlap in a context where we have coordinates defining their position on a page, we can treat the elements as rectangles defined by the following properties:</p><p>• Page: the page number (if two elements are on different pages, they cannot overlap) • x, y: the coordinates of the top-left corner of the rectangle • width, height: the width and height of the rectangle</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Overlap Criterion</head><p>Two elements overlap if and only if their rectangles intersect in a two-dimensional space. Formally, given two rectangles defined by: To check for overlap, we need to verify whether there is no separation between the two rectangles in both the horizontal and vertical dimensions.  Then, two rectangles 𝐴 and 𝐵 overlap if none of the above conditions are true. Formally, they overlap if all the following conditions persist:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conditions for Non</head><formula xml:id="formula_1">𝑥 𝐴 &lt; 𝑥 𝐵 + width 𝐵 , 𝑥 𝐵 &lt; 𝑥 𝐴 + width 𝐴 𝑦 𝐴 &lt; 𝑦 𝐵 + height 𝐵 , 𝑦 𝐵 &lt; 𝑦 𝐴 + height 𝐴</formula><p>To identify the overlapping errors we focus on the rectangles which overlap and for each we classify it as admissible or not admissible, checking as not admissible an overlap generated by classes which are disjoint in the ontology general axioms <ref type="foot" target="#foot_3">4</ref> and that are reported in Table <ref type="table" target="#tab_6">1</ref>. The Overlap violations section of the user interface described in Section 5 helps to highlight the bounding boxes which are associated to classes that are not allowed to overlap (Figure <ref type="figure" target="#fig_6">6</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Future work</head><p>Future directions involve leveraging LaTeX source attributes, the analysis of more relations and ontologies, and a broader employment of the LLM for context length optimization. Currently, the system allows the exploration of visual data which is extracted through the segmentation module. More modules can be linked for extending the knowledge associated with explorable elements, such as the LaTeX representation of the document <ref type="bibr" target="#b26">[27]</ref>. Associating different representations of the document elements would also enable automatic construction of class-specific datasets, i.e. formulas and chemical structures datasets. Moreover, the custom-made user interface described in Section 5 allows code instrumentation, enabling comparison with state-of-the-art systems with similar purposes such as ChatDOC and Amazon Textract.</p><p>The variety of ontology relations employments can be extended for identifying more than overlap errors. To this end, parent relations can be exploited to detect misclassifications and elements missing paired classes (i.e. figures and captions). In addition, the presence of the ontology layer enables the possibility of extending the present structure with broader context ontologies <ref type="bibr" target="#b27">[28]</ref> and exploiting reasoning capabilities to expand actual relations with the inferable ones.</p><p>The main limitation of the LLM module lies in its reliance on user classification of the query, aimed at reducing context length. The same result is achievable through unsupervised classification of the user query, which can be demanded to a dedicated LLM module. It is also to be noticed that the LLM performances would increase by employing language models with more parameters.</p><p>Current objectives include an assessment over the Docbank dataset, which contains numerous layout classes and overlaps such as caption over figure, list over equation and section over author, among others, an excerpt of which is presented in Figure <ref type="figure" target="#fig_7">7</ref>. It is to be noticed that the present analysis on the Docbank dataset does not take into account semantic characterization, thus including some overlapping layout elements that are admissible (i.e. equation over list).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions</head><p>This paper extends scholarly document understanding and document question answering research fields pursuing the association of semantics and context to visual features, integrating them in a comprehensive interface which allows multi-layer exploration via LLM and interactive visualization. Linking semantic information to documents is challenging from a research perspective: most of the solutions reviewed in the state-of-the-art exhibit limited awareness of the described domain, considering only basic relations between text chunks. We leverage the Document Components Ontology focusing on semantic relations among layout elements to detect a specific kind of visual recognition errors, which are overlap errors, paving the way for more sophisticated layout elements interaction analysis. The proposed approach is based on the use of disjointness relations that may exist between overlapping layout elements. This relation is analyzed and interpreted as an indicator of potential recognition errors, providing a systematic way to identify and address inconsistencies in the detected layout structure. By exploiting this property, our method improves the accuracy and reliability of the recognition process. In addition, we exploit LLMs in our framework to enhance the accessibility of diverse information which is not directly available from data, enabling navigation of different kinds of information from an integrated interface.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The SPAR ontologies modules [9].</figDesc><graphic coords="3,72.00,517.75,454.36,66.23" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: The architecture of the system. The features in development are represented as dashed elements and bracketed labels.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: An example of the flow from PDF data to the structured information which is given as context to the LLM.</figDesc><graphic coords="6,255.97,340.92,264.95,82.64" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: User interface with bounding boxes, element data popups (displaying Acknowledgements data) and LLM dialogue (best viewed in color).</figDesc><graphic coords="7,75.01,172.39,445.25,223.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Example of overlapping bounding boxes corresponding to disjoint classes (best viewed in color)</figDesc><graphic coords="8,105.62,65.61,384.04,192.23" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>1 .</head><label>1</label><figDesc>Rectangle A: a) 𝑥 𝐴 , 𝑦 𝐴 (coordinates of the top-left corner) b) width 𝐴 , height 𝐴 2. Rectangle B: a) 𝑥 𝐵 , 𝑦 𝐵 (coordinates of the top-left corner) b) width 𝐵 , height 𝐵</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Detection of not-allowed overlaps based on semantic constraint analysis (best viewed in color).</figDesc><graphic coords="9,75.97,65.61,443.33,171.81" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Top 10 overlap kinds on the Docbank dataset.</figDesc><graphic coords="10,93.64,65.60,408.00,189.60" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>317 .69 ,223 .29 ,240 .52 ,7.94 ; 8,317 .96 ,234 .24 ,240 .42 ,7.94 ; 8,317 .96 ,245 .20 ,149 .63 ,</head><label></label><figDesc></figDesc><table /><note>type =" acknowledgement "&gt; &lt; div &gt; &lt; head coords =" 8,317 .96 ,208 .59 ,91 .99 ,9.37 "&gt;Acknowledgment &lt; p coords =" 8,&lt; s coords =" 8,</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>317 .69 ,223 .29 ,240 .52 ,7.94 ; 8 ,317 .96 ,234 .24 ,99 .42 ,7.94 "&gt; The views expressed here are those of the authors alone not of &lt;rs</head><label></label><figDesc></figDesc><table /><note>type =" institution "&gt;BlackRock , Inc or NVID &lt; /s&gt; &lt; s coords =" 8,</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>419 .62 ,234 .24 ,138 .76 ,7.94 ; 8 ,317 .96 ,245 .20 ,149 .63 ,7.94 "&gt; We are grateful to &lt;rs</head><label></label><figDesc></figDesc><table /><note>type =" person "&gt;Emma Lind &lt;/rs</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>invaluable support for this collaboration .</head><label></label><figDesc></figDesc><table><row><cell>&lt;</cell><cell>/s&gt;</cell></row><row><cell>&lt;</cell><cell>/p&gt;</cell></row><row><cell cols="2">&lt; /div &gt;</cell></row><row><cell cols="2">&lt;/div &gt;</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>PDF excerpt TEI-XML Key-value pairs LLM user interface</head><label></label><figDesc></figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head></head><label></label><figDesc>-Overlap 1. The rectangles do not overlap if one is entirely to the right of the other: 𝑥 𝐴 + width 𝐴 ≤ 𝑥 𝐵 or 𝑥 𝐵 + width 𝐵 ≤ 𝑥 𝐴 2. The rectangles do not overlap if one is entirely below the other: 𝑦 𝐴 + height 𝐴 ≤ 𝑦 𝐵 or 𝑦 𝐵 + height 𝐵 ≤ 𝑦 𝐴</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 1</head><label>1</label><figDesc>General axioms of the DoCO ontology; the layout elements in each row are disjoint from each other. The classes generating the not admissible overlaps in Figure5are underlined.</figDesc><table><row><cell>All Disjoint Classes</cell></row><row><cell>back matter, body matter, captioned box, chapter, complex run-in quotation, footnote, formula, formula box,</cell></row><row><cell>front matter, list, part, section, table</cell></row><row><cell>abstract, afterword, appendix, colophon, foreword, glossary, index, list of figures, list of tables, preface,</cell></row><row><cell>table of contents</cell></row><row><cell>label, paragraph, subtitle, title</cell></row><row><cell>list of authors, list of contributors, list of organizations</cell></row><row><cell>sentence, simple run-in quotation, text chunk</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://chatdoc.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://kermitt2-grobid.hf.space/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://sparontologies.github.io/doco/current/doco.html#d4e145</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://sparontologies.github.io/doco/current/doco.html#generalaxioms</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">CAI4DSA ID:EP_FAIR_001 CUP:B13C23005640006</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements page : 8 Acknowledgements person</head><p>: Emma Lind Acknowledgements text : The views expressed here are those of the authors alone and not of BlackRock , Inc or NVIDIA . We are grateful to Emma Lind for her invaluable support for this collaboration .</p><p>Acknowledgements coordinates : ( page : 8, x: 317 .69 , y: 223 .29 , w: 240 .52 , h: 7.94 ) Acknowledgements coordinates : ( page : 8, x: 317 .96 , y: 234 .24 , w: 240 .42 , h: 7.94 ) Acknowledgements coordinates : ( page : 8, x: 317 .96 , y: 245 .2, w: 149 .63 , h: 7.94 ) Acknowledgements coordinates : ( page : 8, x: 317 .69 , y: 223 .29 , w: 240 .52 , h: 7.94 ) Acknowledgements coordinates : ( page : 8, x: 317 .96 , y: 234 .24 , w: 99 .42 , h: 7.94 ) Acknowledgements coordinates : ( page : 8, x: 419 .62 , y: 234 .24 , w: 138 .76 , h: 7.94 ) Acknowledgements coordinates : ( page : 8, x: 317 .96 , y: 245 .2, w: 149 .63 , h: 7.94 )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This research has been partially funded by CAI4DSA 5 actions (Collaborative Explainable neuro-symbolic AI for Decision Support Assistant), of the FAIR national project on artificial intelligence, PE 1 PNRR (https://fondazione-fair.it/).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Graph neural networks and representation embedding for table extraction in pdf documents</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gemelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Vivoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Marinai</surname></persName>
		</author>
		<idno type="DOI">10.1109/icpr56361.2022.9956590</idno>
		<ptr target="http://dx.doi.org/10.1109/ICPR56361.2022.9956590.doi:10.1109/icpr56361.2022.9956590" />
	</analytic>
	<monogr>
		<title level="m">2022 26th International Conference on Pattern Recognition (ICPR)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A robust framework for mathematical formula detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Pham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Do</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">D</forename><surname>Ngo</surname></persName>
		</author>
		<idno type="DOI">10.1109/MAPR53640.2021.9585197</idno>
		<ptr target="https://doi.org/10.1109/MAPR53640.2021.9585197.doi:10.1109/MAPR53640.2021.9585197" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Multimedia Analysis and Pattern Recognition, MAPR 2021</title>
				<meeting><address><addrLine>Hanoi, Vietnam</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">October 15-16, 2021. 2021</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Lopez</surname></persName>
		</author>
		<idno>arXiv:</idno>
		<ptr target="1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c" />
		<title level="m">Grobid</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Hrdoc: Dataset and baseline method toward hierarchical reconstruction of document structures</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Liu</surname></persName>
		</author>
		<idno type="DOI">10.1609/AAAI.V37I2.25277</idno>
	</analytic>
	<monogr>
		<title level="m">Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Williams</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Neville</surname></persName>
		</editor>
		<meeting><address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2023">February 7-14, 2023. 2023</date>
			<biblScope unit="page" from="1870" to="1877" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Tkaczyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Collins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sheridan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Beel</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1802.01168</idno>
		<title level="m">Evaluation and comparison of open source bibliographic reference parsers: A business use case</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Semantic parsing of interpage relations</title>
		<author>
			<persName><forename type="first">M</forename><surname>Demirtaş</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Oral</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yasin Akpınar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Deniz</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICPR56361.2022.9956546</idno>
	</analytic>
	<monogr>
		<title level="m">2022 26th International Conference on Pattern Recognition (ICPR)</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="1579" to="1585" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Publaynet: Largest dataset ever for document layout analysis</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jimeno-Yepes</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICDAR.2019.00166</idno>
		<ptr target="https://doi.org/10.1109/ICDAR.2019.00166.doi:10.1109/ICDAR.2019.00166" />
	</analytic>
	<monogr>
		<title level="m">2019 International Conference on Document Analysis and Recognition, ICDAR 2019</title>
				<meeting><address><addrLine>Sydney, Australia</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">September 20-25, 2019. 2019</date>
			<biblScope unit="page" from="1015" to="1022" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhou</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2006.01038</idno>
		<title level="m">Docbank: A benchmark dataset for document layout analysis</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">The spar ontologies</title>
		<author>
			<persName><forename type="first">S</forename><surname>Peroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Shotton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Semantic Web-ISWC 2018: 17th International Semantic Web Conference</title>
				<meeting><address><addrLine>Monterey, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">October 8-12, 2018. 2018</date>
			<biblScope unit="page" from="119" to="136" />
		</imprint>
	</monogr>
	<note>Proceedings, Part II 17</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A programming interface for creating data according to the spar ontologies and the opencitations data model</title>
		<author>
			<persName><forename type="first">S</forename><surname>Persiani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Daquino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Peroni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Semantic Web Conference</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="305" to="322" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">The document components ontology (doco)</title>
		<author>
			<persName><forename type="first">A</forename><surname>Constantin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Peroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pettifer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Shotton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Vitali</surname></persName>
		</author>
		<idno type="DOI">10.3233/SW-150177</idno>
		<ptr target="https://doi.org/10.3233/SW-150177.doi:10.3233/SW-150177" />
	</analytic>
	<monogr>
		<title level="j">Semantic Web</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="167" to="181" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Fabio and cito: ontologies for describing bibliographic resources and citations</title>
		<author>
			<persName><forename type="first">S</forename><surname>Peroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Shotton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Web Semantics</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="page" from="33" to="43" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Natural language query formalization to sparql for querying knowledge bases using rasa</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Mishra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Swathi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">C</forename><surname>Akshay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Progress in Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="193" to="206" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Paval: A location-aware virtual personal assistant for retrieving geolocated points of interest and location-based services</title>
		<author>
			<persName><forename type="first">L</forename><surname>Massai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Nesi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pantaleo</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.engappai.2018.09.013</idno>
		<ptr target="https://doi.org/10.1016/j.engappai.2018.09.013" />
	</analytic>
	<monogr>
		<title level="j">Engineering Applications of Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">77</biblScope>
			<biblScope unit="page" from="70" to="85" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2304.11116</idno>
		<title level="m">Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2405.07437</idno>
		<title level="m">Evaluation of retrieval-augmented generation: A survey</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Toolformer: Language models can teach themselves to use tools</title>
		<author>
			<persName><forename type="first">T</forename><surname>Schick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dwivedi-Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Dessì</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Raileanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lomeli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hambro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Cancedda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Scialom</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2308.11761</idno>
		<title level="m">Knowledgpt: Enhancing large language models with retrieval and storage access on knowledge bases</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Lála</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>O'donoghue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shtedritski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">G</forename><surname>Rodriques</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>White</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2312.07559</idno>
		<title level="m">Paperqa: Retrievalaugmented generative agent for scientific research</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Layoutlmv3: Pre-training for document AI with unified text and image masking</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lv</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<idno type="DOI">10.1145/3503161.3548112</idno>
		<idno>doi:10.1145/3503161.3548112</idno>
		<ptr target="https://doi.org/10.1145/3503161.3548112" />
	</analytic>
	<monogr>
		<title level="m">MM &apos;22: The 30th ACM International Conference on Multimedia</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Magalhães</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Bimbo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Satoh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Sebe</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Alameda-Pineda</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Q</forename><surname>Jin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Oria</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Toni</surname></persName>
		</editor>
		<meeting><address><addrLine>Lisboa, Portugal</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2022">October 10 -14, 2022. 2022</date>
			<biblScope unit="page" from="4083" to="4091" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">On leveraging multi-page element relations in visuallyrich documents</title>
		<author>
			<persName><forename type="first">D</forename><surname>Napolitano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Vaiani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cagliero</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2024">2024. 2024</date>
			<biblScope unit="page" from="360" to="365" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Hierarchical multimodal transformers for multipage docvqa</title>
		<author>
			<persName><forename type="first">R</forename><surname>Tito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Karatzas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Valveny</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">144</biblScope>
			<biblScope unit="page">109834</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Gram: Global reasoning for multi-page vqa</title>
		<author>
			<persName><forename type="first">T</forename><surname>Blau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Fogel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ronen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Golts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ganz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ben Avraham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Aberdam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tsiper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Litman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="15598" to="15607" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Document understanding dataset and evaluation (dude)</title>
		<author>
			<persName><forename type="first">J</forename><surname>Van Landeghem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Borchmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pietruszka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Joziak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Powalski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurkiewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Coustaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Anckaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Valveny</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="19528" to="19540" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Automatic generation of scientific papers for data augmentation in document layout analysis</title>
		<author>
			<persName><forename type="first">L</forename><surname>Pisaneschi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gemelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Marinai</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.patrec.2023.01.018</idno>
		<ptr target="https://doi.org/10.1016/j.patrec.2023.01.018" />
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition Letters</title>
		<imprint>
			<biblScope unit="volume">167</biblScope>
			<biblScope unit="page" from="38" to="44" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">S2ORC: The semantic scholar open research corpus</title>
		<author>
			<persName><forename type="first">K</forename><surname>Lo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Neumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kinney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weld</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-main.447</idno>
		<ptr target="https://www.aclweb.org/anthology/2020.acl-main.447.doi:10.18653/v1/2020.acl-main.447" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="4969" to="4983" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">An html/css schema for tex primitives-generating high-quality responsive</title>
		<author>
			<persName><forename type="first">D</forename><surname>Müller</surname></persName>
		</author>
		<ptr target="https://kwarc.info/people/dmueller/pubs/tug23.pdf" />
	</analytic>
	<monogr>
		<title level="j">TUGboat</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="page" from="275" to="286" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Crossref: The sustainable source of community-owned scholarly metadata</title>
		<author>
			<persName><forename type="first">G</forename><surname>Hendricks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tkaczyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Feeney</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Quantitative Science Studies</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="414" to="427" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
