<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>I. Siragusa);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salvatore Contino</string-name>
          <email>salvatore.contino01@unipa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irene Siragusa</string-name>
          <email>irene.siragusa02@unipa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Sciortino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Pirrone</string-name>
          <email>roberto.pirrone@unipa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bibliographic research, OCR, Large Language Model</institution>
          ,
          <addr-line>Knowledge Graph</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Engineering, University of Palermo</institution>
          ,
          <addr-line>Palermo, 90128, Sicily</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In this work, we present PARSAL, a retrieval pipeline to obtain relevant scientific articles in a standardized format, given some relevant keyword. The pipeline exploits the API of scientific publishers to retrieve relevant full-text articles in PDF, JSON, or XML format. Afterwards, a parser was implemented to standardize the retrieved articles in a unique format, thus they can be inserted in a Mongo DB database and accessed via a custom GUI. In addition, papers are arranged in a Knowledge Graph, built via LLamaIndex framework, to allow users to make queries to the collected articles and obtain a verbose answer. The code of the developed pipeline, GUI and Knowledge Graph creation and inference is available on GitHub1.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>can be seen in Figure 1.</p>
      <p>
        The field of scientific research is constantly expanding and is going through a golden moment in the
digital era. In fact, scientific output across all disciplines is growing exponentially year after year [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], as
      </p>
      <p>Early projections suggest that this publishing trend will continue to grow over the next few years,
increasing the number of peer-reviewed articles available to the scientific community, which will</p>
      <p>CEUR
Workshop</p>
      <p>
        ISSN1613-0073
end up with a growing number of articles to read and review. In fact, bibliographic analysis is an
important phase in scientific research and an essential preliminary step in any work. In emerging
ifelds, where new articles are published daily, it is desirable to keep track of continuous changes and
highlight trends to build a reliable systematic overview of a given research topic. In Natural Language
Processing (NLP) field, the advent of the concept of Retrieval-Augmented Generation (RAG) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], i.e. the
combination of document retrieval mechanisms and generative capabilities of Large Language Models
(LLMs), has revolutionized the approach to generating verbose responses. In fact, this technique reduces
hallucinations while increasing the accuracy, reliability, and richness of the responses of the generative
model by providing context from a knowledge base consistent with the topics covered. A RAG-based
architecture can be helpful if used as a support tool in the context of a scientific literature analysis.
However, a large number of scientific publications are necessary to build the desired knowledge base
for the retrieval phase. In addition, as the ever-increasing number of papers published and the time
required to download and arrange them in a suitable textual format for further applications, in an age
where time is increasingly limited, it is necessary to rely on automatic tools for retrieval and formatting.
      </p>
      <p>In this paper, we introduce PARSAL, an automatic open-source retrieval pipeline that can talk to all
APIs of the most popular publishing houses and, through an intuitive graphical interface, download the
most relevant scientific articles based on keyword-search strategy. In addition, PARSAL automatically
formats documents, simplifying the complex pre-processing phase, making them suitable for building
RAG-based inference systems. Thus, the main contributions of this work can be summarized as follows.
1. An extraction pipeline across API of diferent editors given a target keyword;
2. A parser which processes retrieved articles in XML, JSON or PDF and return them in a standardized</p>
      <p>JSON format;
3. A Graphical User Interface (GUI) based over a Mongo DB Database, to interactively access the
retrieved articles;
4. An end-to-end solution for supporting the build of an efective RAG architecture;
This work is organized as follows. Section 2 reports an overview of the state-of-the-art approaches.
Section 3 reports the API available to download full-text articles, while the retrieval pipeline is presented
in Section 4 along with its GUI. To assess the potentialities of the proposed pipeline, a case study is
presented in the field of drug discovery, in which a Knowledge Graph (KG) has been created and queried
accordingly, as a retrieval source in a RAG setup (Section 5), and concluding remarks are drawn in
Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        The availability of advanced tools has recently simplified automatic access to the full-text of scientific
articles, responding to the need to manage an increasingly vast academic output. Over the years,
various open source bots and scripts have been developed to meet this demand, capable of integrating
diferent sources. PyPaperBot 1 is an example of combining searches in Google Scholar and CrossRef to
sequentially download PDFs from DOI lists. Similarly, many oficial platforms provide APIs to facilitate
automatic content retrieval: both options for metadata retrieval from a given DOI of full-text PDF with
services such as CrossRef API [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or Unpaywall2. In addition, both large open archives (arXiv) and
publishers (Elsevier) ofer API interfaces to systematically download scientific PDF papers. A brief
overview of some available APIs in his context is reported in Section 3.
      </p>
      <p>
        At the same time, the scientific community has developed tools to support the literature reviews based
on NLP techniques. A good example is LiteRev [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], an advanced tool designed to speed up systematic
reviews by integrating machine learning algorithms into the bibliographic search process. LiteRev
automatically searches for a wide range of open-access databases based on a user query, retrieving
metadata (including titles, abstracts, and sometimes full-text) from potentially relevant articles [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>1https://github.com/ferru97/PyPaperBot 2https://unpaywall.org/</title>
        <p>
          The collected texts are then preprocessed and represented using text mining techniques (e.g., TF-IDF
vectorization), and then analyzed with dimensional reduction and unsupervised clustering algorithms.
In this way, LiteRev organizes the corpus of documents into key topics, highlighting groups of works
with related keywords, and facilitates the identification of relevant research topics [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The user can
iteratively refine the search by selecting topics of interest, adding keywords, or manually indicating
some key articles. Based on these inputs, the system searches for similar documents (e.g. using k-NN)
and suggests new relevant articles, allowing for guided exploration of related literature [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          Another recent development involves the usage of LLMs to query and synthesize scientific knowledge,
combining automatic document retrieval with LLM generative capabilities. PaperQA [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is a recent
system that adopts a RAG-based method to allow natural language interaction with a set of scientific
articles [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. It operates as an intelligent agent that, given a user query, initially searches for information
through the entire text of available articles, using document-scale information retrieval techniques [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
Then it evaluates the relevance of the identified sources and text passages, and finally feeds a generative
model (LLM) with this relevant content to produce a reasoned response to the question [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Through
this mechanism, the model generates responses based on facts extracted from publications and can
provide precise references to the sources from which it draws information, increasing its transparency
and reliability. This highlights the potential of RAG systems to bridge the gap between the amount of
literature available and the human capacity to absorb and use it efectively to make informed decisions.
        </p>
        <p>
          Finally, several researchers are exploring the use of RAG to automate the review of the scientific
literature. For example, Han et al. provide an exhaustive review of the role of RAG in automating
literature reviews [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This study highlights that LLMs integrated with dynamic retrieval abilities can
automate several stages of the scientific review process, from document retrieval to results synthesis
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In particular, RAG models mitigate the typical limitations of “pure” LLMs, such as hallucination
of inaccurate content due to static knowledge limited to the training period. In contrast, RAG-based
models combine language generation with real-time access to up-to-date information from external
databases [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Han and colleagues point out that such a system can be incorporated into a complete
review workflow. In their proposal, an LLM with RAG sequentially supports bibliographic search,
title/abstract selection, extraction of relevant data from full-text articles, and finally synthesis of results,
thus covering the four key phases of a systematic review [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. These integrated technologies suggest that
current systems and tools are no longer limited to downloading PDFs but are evolving into intelligent
systems and tools that can eficiently search, categorize, and summarize scientific literature, allowing
users to access not only documents, but also insights extracted from them in a readily usable format.
        </p>
        <p>Our proposed pipeline, following an open-source philosophy, in contrast with tools such as ScopusAI3,
permits a customized keyword-based article retrieval from multiple editors’ API. The flexibility of the
framework allows us to arrange collected articles in both structured formats (XLM, JSON) and PDFs, in
a standard format, thus allowing the building of both a custom retrieval and generative component,
which better meets the user’s needs. To stress this point, we propose a case study in the field of drug
discovery, where the retrieval component consists of a KG. Given a user query, the output obtained
may consist of both a verbose answer and a list of interesting sources.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. API overview</title>
      <p>A core point of the proposed pipeline is to obtain the full-text article in an automatic manner. As
a consequence, a study was conducted on the APIs available from the most popular and relevant
publishers for the scientific community. The idea is to involve APIs that allow the search of relevant
articles given a target keyword. More in detail, the available APIs can be mapped as:
• Metadata APIs, which provide only bibliographic information;
• Full-text APIs, which allow full access to the textual context of the requested article. It can be
restricted to open-access publications or include no open-access articles with previous
authentication.
3https://www.elsevier.com/products/scopus/scopus-ai</p>
      <p>A complete overview of selected publishers and characteristics of the available API is reported in the
following subsections, while their main characteristics are summarized in Table 1. Despite MDPI ofers
valuable API endpoint, it has not been included in this analysis as its automatic usage for download
full-text articles is not permitted.</p>
      <sec id="sec-3-1">
        <title>3.1. API @ ACL Anthology</title>
        <p>ACL Anthology hosts open-access articles related to computational linguistics and NLP. Articles can be
accessed via their Python library 4 which allows interaction with the entire articles repository in both
streaming access and fully local. Articles have to be searched programmatically since a custom search
or ranking algorithm is not implemented, in which custom filters may be implemented over the articles’
metadata, such as keyword, title, or publishing year.</p>
        <p>Once the articles’ ID are collected, full-text PDFs can be downloaded without any authentication, as
all articles in the anthology are open-access.
3.2. API @ arXiv
arXiv is a free distribution service and an open-access archive of pre-print articles not peer-reviewed,
and it mainly serves as a platform for sharing improvements in researcher’s work and for articles
waiting to be published in journals or presented to conferences.</p>
        <p>
          Its native open repository [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ] can be used to access to full-text articles via their API5. A parametric
search to retrieve the PDF or XML of the full-text articles and their associated metadata through
OAIPMH [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Neither authentication or API key are required, but limitations are applied in terms of the
rate limit for the number of articles that can be downloaded simultaneously.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. API @ Elsevier</title>
        <sec id="sec-3-2-1">
          <title>4https://aclanthology.org/info/development/ 5https://arxiv.org/help/api/ 6https://dev.elsevier.com/</title>
          <p>In the context of this work, both APIs have been used, and retrieval was limited to ScienceDirect
openaccess articles. During this process, some limitations have been encountered, such as the maximum
number of results per query, the maximum quota available per user, and the rate limit per request.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. API @ Springer Nature</title>
        <p>Springer Nature is a German-British academic publishing company created by the fusion of Springer
Science Business Media and Nature Publishing Group. Articles published are open-access and accessible
under payment, and APIs allow full-text download for open-access articles, while institutional API
access is required for downloading no open-access articles.</p>
        <p>
          Springer Metadata API and Springer Meta API allow retrieval of metadata information about all articles
in the catalog, while Springer Open Access API allows both retrieval of metadata and downloadable
full-text content in JSON, PDF, and JATS XML format [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for open-access articles only. Its search
engine selects the best articles based on the input query. The articles and related metadata found are
sorted by relevance, based on BM25F [11], an optimized version of BM25 for structured documents, or
other criteria, and are output in JSON or XML format. Then open-access articles can be identified and
downloaded in their full-text version in PDF, HTML, or XML format.
3.5. API @ Wiley
Wiley is an American publishing company that focuses on academic publishing, and its Wiley TDM
API allows full-text retrieve of requested articles in both PDF, XML or JSON format.
        </p>
        <p>
          Due to the lack of an API endpoint to retrieve identifiers of relevant articles given a target keyword or
other criteria, CrossRef API7 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] can be used, with an additional filter towards selecting articles whose
DOI belongs to Wiley ones. After retrieving the metadata of the relevant articles, the full-text articles
are downloaded through Wiley TDM API, which requires the target DOI and a TDM authentication
token [12]. Rate limits are applied to the possible amount of downloadable content.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Retrieval pipeline</title>
      <p>In this Section, the proposed retrieval pipeline (Figure 2) is described, which is composed of three
diferent modules, namely the API module, the OCR module, and the Parser module. The API module
employs the publicly accessible APIs described in Section 3, for automatic retrieval of relevant articles
given a keyword-based query. The articles obtained in XML, JSON, or PDF format are then processed
to obtain a unique and standardized format. The OCR module is used for extracting the textual content
from PDF leveraging an Optical Character Recognition (OCR) model, while the parser module creates a
unique standard JSON-based format from either XML or JSON files as well as the obtained from the
OCR model.</p>
      <sec id="sec-4-1">
        <title>4.1. OCR module</title>
        <p>To properly process articles in PDF format, extraction of their textual content is needed. Despite
some metadata information can also be obtained from the PDF (e.g. authors, title, abstract), they can
be easier accessed via Metadata API queries. In addition, metadata retrieved from API provides a
native well-formatted structure of that information. Therefore, the focus in the OCR module is limited
towards the extraction of the actual textual content of the article, as the title of the sections, and the
corresponding content.</p>
        <p>As the core of the OCR module, the Open Language Model OCR (olmOCR) [13] was used. olmOCR is
an open source toolkit developed by Allen Institute for AI (AI2), based on Qwen2-VL-7B-Instruct [14], a
7B parameters Visual Language Model (VLM). It has been fine-tuned over the olmOCR-mix-0225 data</p>
        <sec id="sec-4-1-1">
          <title>7https://www.crossref.org/documentation/retrieve-metadata/rest-api/</title>
          <p>set [15], which consists of 260.000 pages from various PDF files, including scientific papers, books, and
legal documents [13].</p>
          <p>Other OCR models have been considered, such as Tesseract [16], EasyOCR [17], PaddleOCR [18],
docTR [19]) but their performance was poor compared to olmOCR both in terms of accuracy and
computational resources needed (GPU and inference time). In addition, the aforementioned models
needed an additional external model for layout analysis, such as LayoutParser [20], to correctly process
the complex layout of scientific articles. olmOCR, on the other hand, has an embedded component for
layout analysis of the given document, thus allowing the automatic recognition of images, multi-column
content, tables, and equation.</p>
          <p>To extract the textual content of a given PDF file, olmOCR converts each page into a high-resolution
image over which a layout analysis is performed and then the textual content is extracted and arranged
in textual (txt) and markdown (md) format to preserve the structure of the provided document. The
generated output can then be used to extract the target textual content.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Parser module</title>
        <p>The parser module receives as input the metadata information of the given article, the XML or JSON
full-text articles obtained via API requests, and the output of the OCR module. Its objective is to provide,
for each diferent full-text file format, a unique JSON file following a well-defined schema, including
both metadata and textual content in the given article as in Figure 3. The so obtained parsed documents,
sharing a standard and common format, serve as input to build an end-to-end RAG architecture, as
shown in Section 5.
4.3. GUI
In addition to the proposed retrieval pipeline, a user interface was developed using the CustomTkinter
[21] Python library based on Tkinter. This library allows for the creation of custom and intuitive
graphical interfaces thanks to its large number of widgets. The interface allows users to run queries in a
simple way. The left section of the interface allows users to enter keywords using the Search Keyword
bar, select the starting year for the article search, and finally select the publishing houses included in
PARSAL. Once all these parameters have been set, the search can be started by clicking on the Start
Research button, and the results will be displayed in the central box of the app. The query results can
be downloaded or filtered using the check boxes next to each article. For each article, the title, DOI,
associated keywords, and abstract will be returned, allowing the user to evaluate them manually. A
snapshot of the GUI is shown in Figure 4.</p>
        <p>{
}
`doi': `10.1016/j.rico.2024.100489',
`title': `Satellite imagery, big data, IoT and deep learning techniques</p>
        <p>for wheat yield prediction in Morocco',
`authors': [`Abdelouafi Boukhris', `Antari Jilali',</p>
        <p>`Abderrahmane Sadiq'],
`keywords': [`Satellite imagery', `IoT', `ArcGis',</p>
        <p>`Deep learning', `RMSE', `NoSQL Database'],
`sections': {
`Introduction': `The efficient oversight of agricultural operations,</p>
        <p>ensuring [...]',
`Crop wheat yield prediction in morocco: a proposed system': `Yield</p>
        <p>prediction plays [...]',
},
`abstract': `In the domain of efficient management of resources and</p>
        <p>ensuring nutritional consistency, accuracy [...]',
`editor': `Elsevier'</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Computer Aided Drug Discovery study case</title>
      <p>A case study in the field of computer-aided drug discovery was conducted to test the eficiency of the
proposed pipeline. This topic was selected because it can be validated by authors who are experts
in the field in terms of the validity of the articles retrieved, adherence to the reference topic, and the
subsequent construction and querying of the KG.</p>
      <p>In particular, five meaningful keywords were carefully chosen as meaningful for the target field
of interest in the selected study case and have been considered both in their expanded version and
in their condensed form, as they are used in scientific publications. Therefore, Drug Discovery, Bi
Accuracy Prediction, Graph Convolutional Network (GCN), Ligand Based Virtual Screening (LBVS) and
Structure Based Virtual Screening, have been used to retrieve full-text articles from ACL Anthology,
ArXiv, Elsevier, Springer, and Wiley, as reported in Table 2.
Despite the wide availability of articles that are freely accessible or obtainable thanks to agreements
between our University and publishers, only a relatively small quantity was efectively available for
full-text download in an automatic manner.</p>
      <p>Once the articles have been downloaded, we ensured that any duplicates were deleted by DOI
checking, as the same article can be retrieved twice, e.g. among its keywords appear both “GCN” and
“Drug Discovery”. In this phase, each API was queried with a single keyword, so the aforementioned
duplications can be noticed. Despite some publisher allow for complex queries involving logical OR
or AND, we decide to use a single query for each keyword and its abbreviated form, to simulate the
same retrieval conditions. After this retrieval phase, articles in XML and JSON format are injected into
a parser, while the retrieved PDF are preliminary analyzed by the OCR module and then parsed to the
same unique format (Section 4.2).</p>
      <p>The articles obtained in JSON format are then loaded into the proposed GUI app (Section 4.3) to
be browsed interactively. In addition, we explored an interaction modality based on Generative AI
where the user’s query leverages a KG built from the collected articles. In this section, both the KG
building and inference processes are reported. The experiments, for both the OCR and parser module,
KG building and inference have been executed on a cluster with 1 NVIDIA A100 64 GB GPUs from the
Leonardo supercomputer8 via an ISCRA-C application.</p>
      <sec id="sec-5-1">
        <title>5.1. KG creation</title>
        <p>The objective is to create a semantic Knowledge Graph (KG) in which meaningful relationships between
diferent papers can be highlighted. To do it, each parsed article is arranged in multiple documents, each
containing a section of the original article in textual form, with the corresponding metadata, including
the DOI, authors, name of the article, reference section and keywords, as reported in Figure 5.</p>
        <p>Document(
metadata={
`doi': `10.1016/j.rico.2024.100489',
`authors': [`Abdelouafi Boukhris', `Antari Jilali', `Abderrahmane</p>
        <p>Sadiq'],
`keywords': [`Satellite imagery', `Iot', `Arcgis', `Deep learning',</p>
        <p>`RMSE', `NoSQL Database'],
`title': `Satellite imagery, big data, IoT and deep learning</p>
        <p>techniques for wheat yield prediction in Morocco',
`section': `abstract',
`editor': `Elsevier'
},
textual_document={
`text': `In the domain of efficient management of resources and</p>
        <p>ensuring nutritional consistency, accuracy [...]'
)</p>
        <p>}</p>
        <sec id="sec-5-1-1">
          <title>8https://leonardo-supercomputer.cineca.eu/it/home-it/</title>
          <p>From the collected and parsed articles, a KG was automatically built using the LlamaIndex Framework9.
Starting from the collected articles, the KG is created querying an external LLM to extract the semantic
triplets found in the provided textual content and arrange them in a graph-like structure, thus enabling
inter-document connections.</p>
          <p>To this aim, diferent decoder-only generative model [ 22] has been used to create and query the
built KG. In particular, we choose among instruction-tuned LLMs of 4B parameters, namely Phi 4 mini
[23] and Gemma 3 4B [24], and 8B parameters such as Llama 3.1 8B [25], Qwen 3 8B [26]. It is worth
noticing that the choice of the generative model is crucial, since diferent models generate diferent KGs,
despite having been injected with the same documents. We are also aware that more powerful models
can have been considered, such as GPT-4o [27] or Gemini [28], but their usage is under paid APIs and
our priority was open-source models and reproducibility with relatively small computational resources.</p>
          <p>To create the KG, documents have been divided into chunks of 4096 tokens for each of which a
maximum of 5 triplets are extracted, and all selected LLMs were used with the generative parameters
suggested by their authors. A prompt-based generative approach was used to extract relevant semantic
triplets from the given documents. This approach is eficient since it is fully automatized and the
obtained KGs are completely data driven and do not include any external knowledge injection, as for
the typology of relations to be extracted. On the other hand, less significant triplets with respect to
more meaningful ones can be extracted and included in the KG.</p>
          <p>In Figure 6 a simplified graphical version of the built KG is reported that is obtained by selecting the
“Drug-target interaction prediction” node and including its neighbors up to 2-hop distance10.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. KG inference</title>
        <p>Once the KG was built, we selected five meaningful questions, reported in Table 3. The answer to these
questions can be obtained using multiple articles, thus stressing the retrieval and generative capabilities
of the KG coupled with the considered LLMs.</p>
        <p>More in detail, we adopted the LlamaIndex query engine, which generates the desired answer from
the KG in multiple steps, involving the diverse LLMs considered as a generative model. First, the input
query is analyzed to extract the meaningful words in the query itself (keywords11). The obtained results
are then used to retrieve the more meaningful triplets and documents contained in the graph, and,</p>
        <sec id="sec-5-2-1">
          <title>9https://github.com/jerryjliu/llama_index 10The full navigable graph can be downloaded in from the GitHub repository 11in this case keyword is not referred to as keyword in a scientific article</title>
          <p>What is drug discovery? What AI tools can be used in this field?
How can a GCN be used in drug discovery applications?
Which molecular descriptors are widely used in virtual screening approaches?
Which neural networks are most widely used in ligand-based virtual screening approaches?
Which explainability methods are used in drug discovery approaches?
in a generative manner, the retrieved content is used as a context for the LLM to properly answer
the user’s question. If the retrieved context exceeds the maximum number of tokens for the model
window, multiple queries are given to the LLM with the newer context and the yet generated answer,
until the last retrieved context is reached and the last answer generated. KGs created with the selected
LLMs have been queried (Table 3) and the retrieved and generated content has been evaluated and the
computational statics as for the inference time and the GPU required are reported in Table 4.</p>
          <p>For each question we evaluated the percentage of meaningful retrieved content: 4B models cannot
perform well in the retrieval phase and fail since no context has been retrieved. In contrast, 8B models
achieve comparable performances in the retrieval phase, with 70% of relevant context retrieved. As
models have not been fine-tuned for both KG creation and inference, we argue that the diferent native
instruction fine-tuning process over which models have been trained on, played a crucial role, along
with models’ parameter size. More in detail, the proposed application employs not only the purely
verbose generative capabilities of selected models, but also their abilities in traditional NLP tasks such
as Relation Extraction in KG creation and the retrieval phase itself.</p>
          <p>As for performance, while real-time applications with 16GB GPU on average can be implemented
smaller models, 8B models needed more than 32 GB of GPU and more than one minute on average per
each answer. As for the result obtained, the second group of larger models represents the minimum
satisfactory level in terms of both performance and computational requirements.</p>
          <p>In Table 5 the generated response to Q3 and the DOI of the retrieved articles are reported12. Regarding
the validity of the generated content, all responses were evaluated by domain experts, and they found
them definitely correct. Answers have been considered correct also in the case in which a model replies
that no relevant context has been provided to proper answer to the given question, coherently to the
human-judgment in Table 4.</p>
          <p>Overall, both Llama 3.1 and Qwen 3 in instruct and 8B parameters version, result valuable alternative
in the proposed KG and RAG-based application.</p>
          <p>Improvement on the presented approach can be performed in terms of the retrieval phase, involving
Graph DB and SPARQL queries. In addition, this can be beneficial in terms of eficiency in the case of a
12The full generated answers can be downloaded in from the GitHub repository</p>
          <p>Which molecular descriptors are widely used in virtual screening approaches?</p>
          <p>Qwen 3 8B
2D descriptors such as MACCS keys, PubChem fingerprints, and molecular formulae are
widely used in virtual screening approaches.</p>
          <p>The answer is based on the knowledge that molecular descriptors are used in
virtual screening, and the knowledge that 2D descriptors include MACCS keys, PubChem
fingerprints, and molecular formulae.</p>
          <p>Now, answer the following query using the knowledge from the provided context
information and not prior knowledge.</p>
          <p>Sources</p>
          <p>arXiv:2404.04559, arXiv:2411.12748
larger retrieval corpus. In this case, more time would be required for the retrieval pipeline for both the
download and parsing phases with the OCR and Parser modules as in Section 4) and the KG generation
phase.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>Bibliographic research of the scientific literature is a crucial step in any new research activity. The
downloading and organizing of articles is a time-consuming process that requires automated systems
to keep up with the number of articles published every day in several research domains. In this work,
we present PARSAL, an automatic retrieval pipeline aimed at the scientific community in the process of
retrieving and structuring the scientific literature. PARSAL relies on a set of dedicated APIs to download
either native PDF or JSON / HTML / XML articles from the repositories of the main open-access
scientific publishers. PDF documents are converted in text through olmOCR, and all articles have been
transformed in a suitable JSON format, thus enabling the creation of a MongoDB instance that can
be browsed using an easy-to-use graphical interface. One of the main objectives of the paper is to
demonstrate that such an information organization can be a valuable support for advanced retrieval
systems based on generative AI.</p>
      <p>To this aim, we validated PARSAL in the field of drug discovery. 1,840 articles were downloaded and
formatted from various publishers based on 7 keywords in order to build a Knowledge Graph to be
used for the RAG approach through diferent LLMs. Some domain experts formulated five questions
that were posed to the KG; they assessed the responses of the system and found them satisfactory both
as regards the retrieved references and the actual content, when Llama 3.1 and Qwen 3 8B have been
used in their native instruct version.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is supported by the cup project B73C22000810001, project code ECS_00000022
“SAMOTHRACE” (Sicilian MicronanoTech Research And Innovation Center), and the cup project J73C24000070007,
”CAESAR” (Cognitive evolution in Ai: Explainable and Self-Aware Robots through multimodal data
processing). The works presented were partially developed on the Leonardo supercomputer with the
support of CINECA-Italian Super Computing Resource Allocation class C project IscrC_DOCVLM2
(HP10C97VNN).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used DeepL and Quillbot for translation, grammar
and spelling check. After using these tools, the authors reviewed and edited the content as needed and
takes full responsibility for the publication’s content.
[11] S. E. Robertson, H. Zaragoza, M. Taylor, Simple bm25 extension to multiple weighted fields, in:
Proceedings of the Thirteenth ACM International Conference on Information and Knowledge
Management, CIKM ’04, ACM, New York, NY, USA, 2004, pp. 42–49. URL: https://doi.org/10.1145/
1031171.1031181. doi:10.1145/1031171.1031181.
[12] Wiley Online Library, Wiley text and data mining api documentation, 2024. URL: https://
onlinelibrary.wiley.com/library-info/resources/text-and-datamining, accessed: 2025-01-26.
[13] J. Poznanski, J. Borchardt, J. Dunkelberger, R. Huf, D. Lin, A. Rangapur, C. Wilhelm, K. Lo,
L. Soldaini, olmocr: Unlocking trillions of tokens in pdfs with vision language models, arXiv
preprint arXiv:2502.18443 (2025). URL: https://arxiv.org/abs/2502.18443. arXiv:2502.18443, allen
Institute for AI.
[14] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du,
X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, J. Lin, Qwen2-VL: Enhancing vision-language model’s
perception of the world at any resolution, arXiv preprint arXiv:2409.12191 (2024). arXiv:2409.12191.
[15] J. Poznanski, A. Rangapur, J. Borchardt, others, olmOCR-mix-0225 Dataset, https://huggingface.</p>
      <p>co/datasets/allenai/olmocr-mix-0225, 2024.
[16] R. Smith, An overview of the Tesseract OCR engine, in: Proceedings of the Ninth International
Conference on Document Analysis and Recognition (ICDAR 2007), volume 2, IEEE, Curitiba, Brazil,
2007, pp. 629–633. doi:10.1109/ICDAR.2007.4376991.
[17] JaidedAI, EasyOCR: Ready-to-use OCR with 80+ supported languages, 2020. URL: https://github.</p>
      <p>com/JaidedAI/EasyOCR, apache License 2.0.
[18] P. Authors, Paddleocr, awesome multilingual ocr toolkits based on paddlepaddle., https://github.</p>
      <p>com/PaddlePaddle/PaddleOCR, 2020.
[19] Mindee, docTR: Document text recognition, 2021. URL: https://github.com/mindee/doctr, apache</p>
      <p>License 2.0.
[20] Z. Shen, R. Zhang, M. Dell, B. C. G. Lee, J. Carlson, W. Li, Layoutparser: A unified toolkit for deep
learning based document image analysis, arXiv preprint arXiv:2103.15348 (2021).
[21] T. Schimansky, Customtkinter: A modern and customizable python ui-library based on tkinter,
2022. URL: https://github.com/TomSchimansky/CustomTkinter, version 5.2.2.
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin,</p>
      <p>Attention is All you Need, in: Advances in Neural Information Processing Systems, 2017.
[23] A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, et al., Phi-4-Mini Technical Report:
Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs, arXiv preprint
arXiv:2503.01743 (2025).
[24] GemmaTeam, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, et al., Gemma 3 Technical</p>
      <p>Report, arXiv preprint arXiv:2503.19786 (2025).
[25] LlamaTeam, The Llama 3 Herd of Models, arXiv preprint arXiv:2407.21783 (2024).
[26] QwenTeam, Qwen3 Technical Report, arXiv preprint arXiv:2505.09388 (2025).
[27] OpenAI, Gpt-4o system card, arXiv preprint arXiv:2410.21276 (2024).
[28] GeminiTeam, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth,
K. Millican, et al., Gemini: A Family of Highly Capable Multimodal Models, 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <article-title>Knowledge mapping of multicriteria decision analysis in healthcare: A bibliometric analysis</article-title>
          ,
          <source>Frontiers in Public Health</source>
          <volume>10</volume>
          (
          <year>2022</year>
          ). URL: https://www.frontiersin.org/journals/public-health/articles/10.3389/fpubh.
          <year>2022</year>
          .895552/full. doi:
          <volume>10</volume>
          .3389/fpubh.
          <year>2022</year>
          .
          <volume>895552</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hendricks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tkaczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Feeney</surname>
          </string-name>
          ,
          <article-title>Crossref: The sustainable source of community-owned scholarly metadata</article-title>
          ,
          <source>Quantitative Science Studies</source>
          <volume>1</volume>
          (
          <year>2020</year>
          )
          <fpage>414</fpage>
          -
          <lpage>427</lpage>
          . URL: https://doi.org/10.1162/ qss_a_00022. doi:
          <volume>10</volume>
          .1162/qss_a_
          <fpage>00022</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Orel</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ciglenecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thiabaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Temerev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Calmy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Keiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Merzouki</surname>
          </string-name>
          ,
          <article-title>An Automated Literature Review Tool (LiteRev) for Streamlining and Accelerating Research Using Natural Language Processing and Machine Learning: Descriptive Performance Evaluation Study</article-title>
          ,
          <source>Journal of Medical Internet Research</source>
          <volume>25</volume>
          (
          <year>2023</year>
          )
          <article-title>e39736</article-title>
          . doi:
          <volume>10</volume>
          .2196/39736.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lála</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. O</given-names>
            <surname>'Donoghue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shtedritski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Rodriques</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>White</surname>
          </string-name>
          , PaperQA: RetrievalAugmented Generative Agent for Scientific Research, arXiv preprint arXiv:
          <volume>2312</volume>
          .07559 (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2312.07559.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Susnjak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathrani</surname>
          </string-name>
          ,
          <article-title>Automating Systematic Literature Reviews with RetrievalAugmented Generation: A Comprehensive Overview</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <article-title>9103</article-title>
          . doi:
          <volume>10</volume>
          . 3390/app14199103.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ginsparg</surname>
          </string-name>
          ,
          <article-title>It was twenty years ago today</article-title>
          ...,
          <source>arXiv preprint arXiv:1108.2700</source>
          (
          <year>2011</year>
          ). arXiv:
          <volume>1108</volume>
          .
          <fpage>2700</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Larivière</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Sugimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Macaluso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Milojević</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cronin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Thelwall, arxiv e-prints and the journal of record: An analysis of roles and relationships</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>65</volume>
          (
          <year>2014</year>
          )
          <fpage>1157</fpage>
          -
          <lpage>1169</lpage>
          . doi:
          <volume>10</volume>
          .1002/asi.23044.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lagoze</surname>
          </string-name>
          , H. Van de Sompel,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nelson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Warner</surname>
          </string-name>
          ,
          <source>The Open Archives Initiative Protocol for Metadata Harvesting - Version 2.0</source>
          ,
          <string-name>
            <surname>Technical</surname>
            <given-names>Report</given-names>
          </string-name>
          , Open Archives Initiative,
          <year>2002</year>
          . URL: http://www.openarchives.
          <source>org/OAI/2</source>
          .0/openarchivesprotocol.htm.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>National</given-names>
            <surname>Information Standards Organization</surname>
          </string-name>
          ,
          <source>JATS: Journal Article Tag Suite, Version</source>
          <volume>1</volume>
          .3, ANSI/NISO Standard Z39.
          <fpage>96</fpage>
          -
          <lpage>2021</lpage>
          , NISO,
          <year>2021</year>
          . URL: https://www.niso.org/standards-committees/jats.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>