<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mihai Gheorghe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cătălina Chinie</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dumitru Roman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bucharest University of Economic Studies</institution>
          ,
          <addr-line>Piața Romană 6, Bucharest</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>SINTEF AS</institution>
          ,
          <addr-line>Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The increasing volume of scientific literature and the abundance of publicly accessible data present a substantial hurdle for researchers aiming to stay informed and effectively derive valuable insights. In this paper we discuss the use of LLMs in the context of extracting information from scientific literature and introduce an AI-driven Research Assistant that uses custom Retrieval Augmented Generation (RAG) as a Service and other techniques to streamline processes such as literature review, information extraction, and knowledge discovery.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Artificial Intelligence</kwd>
        <kwd>Retrieval Augmented Generation</kwd>
        <kwd>Automated literature review</kwd>
        <kwd>Information extraction 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A variety of techniques are being explored to overcome these limitations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These include:
•
•
•
•
•
•
•
      </p>
      <p>Reranking: Employing a two-step retrieval process, first retrieving a larger set of
candidates using efficient methods like cosine similarity, then reranking them using more
sophisticated techniques like cross-encoders or even LLMs themselves, to better capture
relevance and address multi-hop reasoning.</p>
      <p>
        Hybrid RAG: Combining semantic similarity-based retrieval with knowledge graphs to
incorporate explicit relationships and facilitate more complex reasoning [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Large Context LLMs such as Claude with its 200k token context or Gemini with its 1M+
token context, can significantly enhance RAG systems by incorporating more extensive
context directly into the model. This reduces the need for explicit retrieval of external
information in many cases, allowing the LLM to draw upon a broader knowledge base to
understand and reason over larger chunks of information [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Hierarchical Embeddings: Leveraging embeddings of the original text alongside various
levels of summaries, such as in the RAPTOR model, to enhance retrieval accuracy and
efficiency [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Multi-Hop Query Answering: Decomposing complex queries into simpler sub-queries
and employing techniques like Chain-of-Thought prompting to guide LLMs through
multi-step reasoning [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ].
      </p>
      <p>
        Autonomous Agents: Utilizing AI agents to navigate diverse information sources and
construct more intricate prompts for the LLM, incorporating logic and multi-step
reasoning [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Self-RAG: Where the LLM itself participates in the retrieval process, potentially leading
to more adaptive and context-aware retrieval [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>These advancements highlight the active research and development in the RAG paradigm,
aiming to address its limitations and enable LLMs to tackle increasingly complex and nuanced
information needs.</p>
      <p>Despite these advancements, constructing a RAG system specifically for scientific research
remains a difficult task. Challenges arise in handling diverse content types commonly found in
research papers, such as tables, images, and formulas, which often necessitate conversion into
natural language text for effective retrieval. Moreover, the research process often involves
snowballing, where the corpus of relevant references expands iteratively from an initial set of
studies.</p>
      <p>
        A review of existing AI assistants and RAG systems tailored towards researchers reveals a
lack of a universally applicable solution, although several platforms demonstrated advanced
techniques and some even appeared customized for scientific research. Existing solutions in this
area can be categorized into three distinct classes:
• AI Document Assistants: Constituting the most prevalent category, these solutions
range from freely available to premium licensed offerings. While they leverage
cuttingedge LLMs and enable users to upload documents in various formats, answer questions
based on those documents, and perform summarization, they often exhibit limitations
specifically concerning academic research support. These constraints include operating
on a limited number of documents (typically restricted by the LLM's context window),
lacking specialization in scientific papers, and generally not performing multi-hop
reasoning across multiple documents.
• General purpose RAG as a Service: Augmenting document assistants with retrieval
capabilities enables access to a substantially larger corpus of documents. Solutions
within this category encompass both commercial and freely available options. The
following are noteworthy examples:
•
• RAGflow: An open-source RAG engine with a comprehensive feature set and a
user-friendly, modern interface [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It offers partial customization in terms of
embedding models and LLMs, alongside fine-grained control over chunking
strategies. It also provides tools for constructing custom AI agents. However, as
a general-purpose product, it relies on paragraph-length chunking rather than
semantic or logical separation. It lacks automated corpus construction and
multimodal capabilities. Despite these limitations, the product shows promise due to
its ongoing development.
• Vectara: A company specializing in RAG solutions [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], with a strong reputation
in the field. However, they do not offer an out-of-the box readily available
Academic RAG as a Service product, offering on demand, customized software.
• Other solutions: Several other solutions exist, such as Humata AI [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Digilist
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Weaviate Verba [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Anything LLM [16], and RAGify [17]. However, these
options lack certain crucial features, including multi-hop capabilities, custom
structured information extraction, re-ranking, traceability, and agentic behavior.
Academic Research Specialized Assistants: This category comprises solutions
specifically designed for academic research. Notable examples include:
• Sakana AI Scientist: Primarily focused on autonomous generation of complete
academic papers [18], its Q&amp;A capabilities are not the central feature. Although
the demonstrated results are impressive, concerns persist regarding
benchmarks for factuality and ethical considerations. Similarly, Insilico's Dora
[19] also generates full papers without chat or Q&amp;A functionalities, and appears
to have fewer features compared to Sakana. Unriddle.ai [20], another solution in
this category, generates full papers and even offers LaTeX rendering, but lacks
traceability, multi-hop capabilities, LLM or embedding model agnosticism, and
structured information extraction. Notably, these solutions do not operate as
traditional RAG as a Service platforms.
• Clarivate AI Academia: Recently announced by Clarivate [21], detailed
information about its features and performance remains limited. However, given
Clarivate's established track record, the product has strong potential to become
a noteworthy contender in this space.
      </p>
      <p>To overcome the above mentioned limitations we initiated the development of a RAG as a Service
Research Assistant whose features we briefly introduce in the following section.</p>
    </sec>
    <sec id="sec-2">
      <title>2. AI Research Assistant</title>
      <sec id="sec-2-1">
        <title>2.1. Dynamic corpus construction</title>
        <p>Researchers can upload their own PDF papers or initiate automated downloads for specific
queries through the implementation of ArXiv and Semantic Scholar APIs, thus having access to
millions of papers. GROBID processes papers, resulting in structured XML representations with
clearly defined sections, figures, tables, and references [22]. The automated download function
can also recursively expand the corpus by extracting references from the initial document set.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Retrieval and question-answering</title>
        <p>Full plain text sections from papers are indexed in ChromaDB [23] using cutting-edge
BAAI/bgem3 [24] dense embeddings. Deviating from most Q&amp;A RAG systems, we employ large paragraph
chunks (often entire chapters/sections) to maintain context. Oversized paragraphs are divided
into subsections while preserving sentence integrity. Question answering employs cosine
similarity retrieval, with results re-ranked using BAAI/bge-reranker-v2-m3 [25]. In cases where
no relevant documents are retrieved, the system transparently informs the researcher that the
answer is not grounded in the corpus. Each answer is accompanied by source paper sections,
promoting transparency and facilitating further exploration.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Custom information extraction</title>
        <p>Our AI Research Assistant facilitates the extraction of custom-structured information from
scientific papers by generating valid JSON schemas from natural language queries. The standard
schema extracts a comprehensive set of data including definitions, indicators, hypotheses, key
findings, topics, and summaries from each paper. These summaries can be indexed within the
vector database as well, enabling the system to also respond to high-level conceptual queries, in
contrast to specific questions grounded in isolated paper sections which is addressed by the
previously mentioned Q&amp;A RAG features. Cost analysis indicates that leveraging a Model as a
Service (for instance, Claude Haiku [26]) for information extraction incurs an estimated cost of
1 USD per 250 papers.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Architecture and scalability</title>
        <p>The system adopts a decoupled multi-server architecture for scalability, with LangChain [27]
partially managing orchestration. A GPU-intensive machine is required for vector embedding,
re-ranking, and local inference when utilizing local LLMs. The workflow exposes API endpoints
that can be consumed by a web application to manage user access.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. AI Agents</title>
        <p>The project incorporates agentic behavior, utilizing a ZERO-SHOT Classifier to direct user
queries to the most suitable tool. These tools include classic RAG/Q&amp;A based on articles,
structured information extraction, and deterministic queries to SQL-like datasets (e.g., locally
hosted EUROSTAT data). LangChain's agentic implementation allows for chaining multiple tools
in a single user query. Therefore, an answer to a user query can be grounded in both the scientific
corpus and a relevant dataset.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Relevance to Rule-Based AI, Decisions, and Reasoning</title>
      <p>Our approach achieves deterministic and explainable results through:
•
•
•
•
•</p>
      <p>Deterministic LLM use (temperature set to 0)
Grounding question-answering in the scientific literature corpus and indicating the paper
sources along the answer
Enforcing JSON schemas for structured information extraction
Verbose mode for agentic workflow, enhancing explainability</p>
      <p>Integration of conventional programming tools for querying structured data sources
This research assistant offers a significant contribution to AI-powered literature analysis,
providing researchers with a valuable tool for navigating the expansive landscape of scientific
knowledge.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Further directions</title>
      <p>We aim to integrate image and figure extraction from papers, leveraging multi-modal LLMs to
enrich the dataset. Open models such as PaliGemma [28] or Idefics [29] can do image-to-text
inference locally on reasonably accessible hardware.</p>
      <p>Related to multi-hop reasoning, we plan to employ techniques like query decomposition and
graph neural networks to address complex, multi-step queries.</p>
      <p>Additionally, we plan to develop more agentic tools capable of handling various datasets,
expanding the system's capabilities and adaptability.</p>
      <p>In the mid to long term, we intend to enhance our system by integrating knowledge graph
retrieval with the existing semantic similarity-based approach. This knowledge graph will be
constructed during the paper processing and parsing stage, extracting relevant entities like
topics, affiliations, and named entities as nodes. This hybrid approach aims to facilitate more
nuanced and complex query handling, enabling the AI Research Assistant to better understand
and leverage the intricate relationships within scientific literature.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This research is financed under the Romanian National Recovery and Resilience Plan, by the
Romanian Government, under the contract number 268/29.11.2022, Entitled “CAUSEFINDER
CAUSALITY IN THE ERA OF BIG DATA”.
[16] https://github.com/Mintplex-Labs/anything-llm
[17] https://github.com/kanad13/RAGify
[18] C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha, The AI Scientist: Towards Fully</p>
      <p>Automated Open-Ended Scientific Discovery (2024). doi: 10.48550/arXiv.2408.06292
[19] https://insilico.com/science42/dora
[20] https://www.unriddle.ai
[21] G. Ben-Porat, Introducing the Clarivate Academic AI Platform, 2024. URL:
https://clarivate.com/blog/introducing-the-clarivate-academic-ai-platform
[22] GROBID [Computer software] (2008–2024). URL: https://github.com/kermitt2/grobid
[23] https://www.trychroma.com
[24] https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3
[25] https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_reranke
r
[26] https://www.anthropic.com/news/claude-3-haiku
[27] https://www.langchain.com
[28] https://huggingface.co/google/paligemma-3b-pt-896
[29] https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive NLP tasks</article-title>
          ,
          <source>in: NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2005</year>
          .
          <volume>11401</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          , et al.
          <article-title>Searching for Best Practices in Retrieval-Augmented Generation (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2407.01219.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sarmah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pasquali</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <article-title>HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2408.04948.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mei</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <article-title>Retrieval Augmented Generation or LongContext LLMs? A Comprehensive Study</article-title>
          and Hybrid
          <string-name>
            <surname>Approach</surname>
          </string-name>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2407.16833.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abdullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tuli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goldie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Raptor:
          <article-title>Recursive abstractive processing for tree-organized retrieval (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2401.18059.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Retrieve, Summarize, Plan:
          <article-title>Advancing Multi-hop Question Answering with an Iterative Approach (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2407.13101.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , Multihop-RAG:
          <article-title>Benchmarking Retrieval-Augmented Generation for MultiHop Queries (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2401.15391.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval (</article-title>
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2009</year>
          .
          <volume>12756</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gamage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mills</surname>
          </string-name>
          , D. De Silva,
          <string-name>
            <given-names>M.</given-names>
            <surname>Manic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Moraliyage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jennings</surname>
          </string-name>
          , &amp; D. Alahakoon,
          <article-title>MultiAgent RAG Chatbot Architecture for Decision Support in Net-Zero Emission Energy Systems</article-title>
          ,
          <source>in 2024 IEEE International Conference on Industrial Technology (ICIT)</source>
          , Bristol, United Kingdom,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICIT58233.
          <year>2024</year>
          .
          <volume>10540920</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sil</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , Self-RAG:
          <article-title>Learning to Retrieve, Generate, and Critique through Self-Reflection (</article-title>
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2310.11511.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>[11] https://ragflow.io</mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>[12] https://vectara.com</mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>[13] https://blog.invgate.com/humata-ai</mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>[14] https://www.diligist.io</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>[15] https://github.com/weaviate/verba</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>