Retrieving Information Presented on Web Pages Using Large Language Models: A Case Study

Retrieving Information Presented on Web Pages Using Large Language Models: A Case Study ThomasAsselborn thomas.asselborn@uni-hamburg.de Institute of Humanities-Centered Artificial Intelligence Universität Hamburg

Warburgstraße 28 20354 Hamburg Germany

Centre for the Study of Manuscript Cultures Universität Hamburg

Warburgstraße 26 20354 Hamburg Germany

Cluster of Excellence 'Understanding Written Artefacts' (UWA) University of Hamburg

Warburgstraße 26 20354 Hamburg Germany

KarstenHelmholz karsten.helmholz@uni-hamburg.de Centre for the Study of Manuscript Cultures Universität Hamburg

Warburgstraße 26 20354 Hamburg Germany

Cluster of Excellence 'Understanding Written Artefacts' (UWA) University of Hamburg

Warburgstraße 26 20354 Hamburg Germany

RalfMöller ralf.moeller@uni-hamburg.de Institute of Humanities-Centered Artificial Intelligence Universität Hamburg

Warburgstraße 28 20354 Hamburg Germany

German Conference on Artificial Intelligence

September 23 47th, 2024 Würzburg Germany

Retrieving Information Presented on Web Pages Using Large Language Models: A Case Study 1613-0073 A8A5C4E5A10F81980FAE1221312BF4B5 GROBID - A machine learning software for extracting information from scholarly documents Artefact Profiling Large Language Models (LLMs) Generative Pre-Trained Transformer (GPT) Retrieval Augmented Generation (RAG) Web Pages

Developing web pages is a task that requires constant updating of new information. Additionally, multiple web pages with the same information must be developed but compiled differently if multiple user groups are targeted. Thus, we introduce a new approach that uses LLMs (Large Language Models), RAG (Retrieval Augmented Generation) and SCDs (Subjective Content Descriptions) to query the information on a web page and also provide sources to the original data. Since the LLM can rewrite the response based on the target user group, this reduces the need to make multiple web pages. Based on the example of the Artefact Profiling Guide, this approach reduces the need to provide user-specific web pages. Additionally, this method reduces the need for an expert in web programming and designing by offloading the task of presenting new data to the LLM. The prototype system has shown promising results so far. It provides the correct answers grounded by the source and written appropriately for the target group.

Introduction

Web pages, like those for universities or corporations, are typically segmented into parts that offer information targeted to a specific user group. A university web page is usually divided into sections specifically curated to provide information for students, people interested in studying, employees, other researchers, the general public, etc. While this approach works fine, as shown in practice, there are specific scenarios where one may want to offer a single web page targeted to many user groups that may not even be known a priori. One example is that the information provided may interest a large group of people with different needs, be it pupils or scientific researchers. In this case, offering different web pages with duplicate content would generally be necessary, each curated for the respective audience. Alternatively, one has the option of offering a single web page written in such a way that it can address all target groups at once. However, web pages designed to be an all-in-one solution run the risk of making them too complicated for some yet too easy for others, thereby disappointing most if not all, users.

LLMs utilising the transformer architecture [1] like GPT (Generative Pre-trained Transformer) [2] have several beneficial properties for making them an ideal candidate to be used on web pages with different and diverse user groups. On the one hand, they can be tailored to a specific downstream task, allowing them to acquire specific knowledge. Additionally, they can adapt their language to the specific user writing queries, e.g., by telling the user that the query should be explained to a 10-year-old child.

In this paper, we thus propose a different method (Section 3) of dealing with the problem of generating multiple distinct web pages by using LLMs to return the content a user may ask about and to provide citations to the original resource. It aims to investigate whether it is possible to replace web pages with a ChatGPT-like chatbot (Section 4) and which benefits and potential problems this approach may have (Sections 4 and 5). Before going into detail, Section 2 introduces the Artefact Profiling Guide.

Related Work

Working with LLMs and RAG is an actively researched topic. Gao et al. [3] are comparing different methods of applying the general principles of RAG. Ramesh et al. [4] have proposed context tuning for RAG to improve retrieval.

RAG is a method that several companies also use in practice to increase productivity. Examples include but are not limited to Telescope (a sales automation platform), Assembly (a human resources platform) and Causal (a financial planning tool). [5] Also, bigger companies like IBM, Google, NVIDIA, and Microsoft are using RAG for various tasks. [6] Using LLMs to generate web pages is a feature that Perplexity introduced a few months ago, which they called Perplexity Pages1 . There, Perplexity queries and results can be displayed automatically on a web page. This helps to produce web pages without the need to know web development and design. However, their approach focuses on making web pages with information already present on the internet and compiling them into a single view. In contrast, our approach is focused on retrieving information that would typically be displayed on a web page with the help of LLMs.

The Artefact Profiling Guide

The Artefact Profiling Guide is an online guide written by members of various lab teams of CSMC (Centre for the Study of Manuscript Cultures). Its goal is to explain the scientific methods and analytical means used to analyse written artefacts to foster interdisciplinary work between the humanities, the natural sciences and computer science. It is meant as a broad overview with more detailed explanations provided as links to dedicated sources.

While the guide is primarily written for a target group of researchers and specialists, it is also meant to provide information to other potential user groups, such as libraries, museums, and other collections that do not necessarily have the expertise or the equipment to do analyses. Another potential user group may be private collectors who want more information about an artefact in their collection. Beyond those specific user groups, the guide is also meant to provide the general public with information about some aspects of the research carried out at CSMC. The general public is a blurry group that could range from young pupils to older people wanting to broaden their knowledge and anyone in between. This variety of target groups makes it difficult to provide the desired explanation level and find a language appropriate for all users.

Artefact profiling is an interdisciplinary field combining analytical means from various fields to understand written artefacts. It combines, among others not mentioned here, multiple so-called "omics" approaches:

• Genomics: This studies the genetic material in the (written) artefact.

• Proteomics: Analysation of proteins and peptides.

• Metabolomics: Metabolites and small molecules are examined.

• Metallomics: This deals with the metals and their distribution.

• Isotopolomics: Isotope ratios in the written artefacts are studied.

Artefact profiling aids in dating, locating and authenticating written artefacts. [7] Artefact profiling is a dynamic field of research with new approaches emerging and new devices being developed to implement them. (see2 ). Additionally, researchers are switching universities and positions frequently. Thus, the Artefact Profiling Guide needs to be updated regularly.

Method

The data in the artefact profiling guide is constantly evolving and probably also not part of most LLMs pre-training. Thus, a way to ensure the LLM can understand new information must be used. Currently, there are two main methods of making sure that the LLM works better on a specific downstream task:

1. Fine-tuning: In fine-tuning, one takes a so-called pre-trained model, like Llama 3 [8], which is typically trained in a self-supervised3 way. The model is typically fine-tuned in a supervised way using a smaller task-specific data set. With fine-tuning, the model´s parameters are changed. While it is far less demanding regarding data set size and hardware resources than pre-training and is thus more feasible for smaller companies or research groups, fine-tuning still needs a GPU to be done efficiently. 2. RAG [9]: RAG is another method of making sure that the LLM works better on a specific downstream task. In contrast to fine-tuning, the internal parameters of the models are not changed; instead, context is provided using automatic prompt augmentation. Typically, the data the user wants to add to the LLM is encoded in a vector database using, e.g., BERT [10] embeddings.

When the user sends a query, it is first encoded using the same encoding, the k best results from the vector database are returned and then automatically appended to the query, which is then sent to the LLM. This works because LLMs have a property called in-context learning [11], allowing the LLM to answer queries about a topic that was never seen in the pre-training data set.

We have decided to use RAG for our approach as this is less demanding in terms of hardware requirements, i.e., while fine-tuning requires a dedicated GPU, RAG also works with CPUs. Additionally, this is necessary because the data in the Artefact Profiling Guide is dynamically changing. Everything on the right of the line is executed after every user query.

Figure 1 shows a broad overview of the process. It is divided into two sections. On the left of the dotted red line is the pre-processing. This only gets executed when a change to the data source, in this case, the Artefact Profiling Guide, has happened. From there, the embeddings to be stored in the vector database and the SCDs [12] are computed. SCDs contain additional data that is attached to locations in a text document. They can be descriptions, links, or labels and can be automatically generated. Thus, both pre-processing steps are completely automated without the need for human intervention.

The process on the right of the dotted, red line is executed at every user query. It can be split up into the following steps. It follows the same structure that was already introduced in [13] as ChatHA.

1. The user enters a query and selects a user group. A few groups and example queries may be predefined (see Figure 2a). 2. After the user has entered its query, it is embedded using the same algorithm that was used in the pre-processing, e.g., Sentence BERT 4 . This is then used to retrieve the n most similar entries from the vector database using, e.g., using the cosine similarity or Euclidean distance [14]. The results are then used to augment the original user query. Additionally, the query is also augmented by the user group that was selected prior. 3. Having the augmented query, this is then used to send the query to the LLM. This could be the OpenAI API using GPT4 [15] or a local, open-source LLM like Llama 3 [8]. 4. The output from the LLM, which we call raw output, is then annotated with links to the original data source. Using the Most Probably Suited SCD (MPS 2 CD) algorithm [16], the most suitable SCDs from the set of known SCDs (stored in the SCD matrix) are computed for the raw output. This then provides the citation with links to the source. Figure 2b shows how this may look practically 5. The final output with the citations is shown to the user.

Experiment

A few tests have been performed to test this approach. First, performance for the pre-processing step was evaluated, i.e., the time it took to calculate the embeddings for the vector database and the SCDs. Everything was computed on a Macbook Pro with an M3 processor and 16GB of RAM, which is no special hardware. Thus, it is reasonable to assume that it should take roughly the same time. Since both processes are independent, they can be executed in parallel, which was also done for this test. We used ChromaDB 5 with the default Sentence BERT embeddings for the vector database. The results are as follows:

• Total number of words in the document: 67,438 • Total time to calculate and store embeddings: 10.5 min.

• Total time to calculate and store SCDs: 26.5 min.

All the numbers are in a range that makes it possible to be rerun regularly, e.g., once per week.

As a second experiment, a few test questions were asked to the system. This is then evaluated qualitatively. The key results are the following.

RAG does work in our test example

Mostly, the answers returned by the system are correct and factually backed by the data source. Also, this helped to answer questions that a standard LLM like Llama 3 could not answer. One such example is:

• Question: "What is ENCI?" • Answer: "Based on the provided context, ENCI is mentioned as a type of equipment used at the CSMC (Center for the Study of Manuscript Cultures) for non-destructive 3D material characterisation. The equipment is described as being able to study surface topography and mechanical properties, and its mobility is considered to be stationary. It is also mentioned that ENCI can be used for in-situ measurements with portable equipment. It appears that ENCI is a specialised tool used for the analysis of materials and artifacts, particularly in the field of manuscript cultures and studies. "

Standard Llama 3 was unable to answer that question at all.

System is capable of automatically changing the "language" used to answer

The system aims to give answers in a language appropriate to the specific user group. The following example illustrates this:

• Question: "What is Raman spectroscopy?"

• Answer for 10-year old: "Raman spectroscopy is a special way to look at things without touching them or hurting them. It's like taking a picture of the things, but instead of using a camera, it uses special light to see what's inside the things. And it's very good for looking at old and special things, like art and treasures, because it doesn't hurt them at all!" • Answer for a private collector: "Dear fellow collector, Raman spectroscopy is a fantastic technique that allows us to examine and analyse our beloved artefacts without causing any harm to them. It's a non-invasive and nondestructive method that uses special light to "see" what's inside our objects, allowing us to learn more about their composition, materials, and even their age. It's been particularly useful in the field of Cultural Heritage, where preserving our treasures is of utmost importance. " • Answer for a researcher: "As a researcher, I can tell you that Raman spectroscopy is a non-invasive and non-destructive analytical technique that relies on the inelastic scattering of monochromatic light by atomic vibrations in the visible, near-infrared, and near-ultraviolet range. This technique has gained significant popularity in the field of Cultural Heritage over the past few decades, as it allows us to analyse samples without causing any harm. The advantages of Raman spectroscopy include its ability to be truly non-destructive. "

Citation system with SCDs has issues when the language used in answer differs significantly from original language

While the system can provide the appropriate language depending on the target group, this leads to problems with generating citations. Depending on how the LLM rewrites the answer, the MPS 2 CD algorithm may only sometimes be able to find the correct citations. This needs to be investigated in the future in more detail.

Problems and Challenges

Working with LLMs still has some issues that may be faced also in the context of the Artefact Profiling Guide. The first observation was that the automatic prompt extension influenced the results significantly.

If the user is interested in information only grounded by the provided documents, it should be clearly stated in the prompt extension. Otherwise, the LLM also tends to give general information in the prompt. Depending on the task, this may or may not be wanted. While RAG is a method that aims at reducing hallucinations, the LLM may still give (partially) incorrect answers. One example is the following:

• Question: "What is the CSMC?" • Answer (using Llama 3 Instruct 8B): "I apologise for not providing context earlier! The CSMC likely refers to the Center for Systems Medicine and Computational Biology, which appears to be a research institution or organisation. However, without more information, it's difficult to determine the specific details about this center.

If you could provide more context or clarify what the CSMC is, I'd be happy to help with your original question!"

This incorrect answer was provided even though the documents provided had the necessary information to give a correct answer.

Opening an LLM to be used by the public also adds problems that cannot be solved by using RAG or even fine-tuning. They mainly arise because the system should be openly available without logging into it. A few of these issues are discussed here.

One issue that may occur is that opening the LLM without user registration (registration is probably not what the provider of a public web page wants) opens the possibility for Denial-of-Service (DoS) attacks, i.e., the input is flooded with queries that use up all of the hardware resources on the server making the web page unreachable for all legitimate users. Since LLMs have a higher demand for hardware than standard web pages, the limit will likely hit sooner. Thus, mitigation methods need to be thought of thoroughly.

Additionally, user protection must be incorporated into the system, especially for children. One of the texts in the data set has a medieval Arabic medicine book as an example containing medical exams and treatments. This also includes descriptions and illnesses that may not be suitable for children, yet it would still be technically correct for the LLM to return them as an answer. Some ways must be introduced to avoid giving inappropriate but technically correct answers.

Conclusion and Outlook

In this paper, we introduced a new method of getting and providing information on a web page using LLM and RAG. We showed that it is feasible, but we also pointed out some potential issues that needed to be mitigated, such as DoS attacks or child safety issues. Additionally, changing the concrete prompt used for automatic prompt augmentation makes a difference in how the LLM will answer.

Future focus should be on how citations are handled when the answer uses significantly different words while still giving the correct answer, i.e., when the LLM answers in simple words while the original data source is written in a scientific language. Additionally, user tests with a more extensive user group may provide more insights into evaluating the introduced method. Currently, the method only works with text data, with future work investigating methods of incorporating images and videos that are integrated into the RAG pipeline and in the citation with the SCDs. It would also be beneficial to extend the system to understand the user automatically without telling it which user group the user belongs to.

Figure 1 :1Figure 1: Broad overview of the underlying process. Everything on the left of the dotted, red line is part of the pre-processing. This only needs to be done once per change of the content of the Artefact Profiling Guide. Everything on the right of the line is executed after every user query.

(a) Mockup of how the web page may look like. User can select their group or a free input to target generation for their specific needs. Additionally, one can consider adding a few example queries as a starting point (not yet implemented). The pink field with the number inside indicates references the user can follow.(b) Example, how citations may look like after clicking on the pink field as seen in Figure2a. In this example, information came from a PDF file but can potentially come from any type of text data.

Figure 2 :2Figure 2: Prototype implementation of the web page replacement. The top shows the question and answer, while the bottom shows what one citation may look like. https://www.perplexity.ai/de/hub/blog/perplexity-pages https://www.csmc.uni-hamburg.de/publications/blog/2024-04-30-enci-inauguration.html https://www.ibm.com/topics/self-supervised-learning https://sbert.net/ https://www.trychroma.com/

Acknowledgments

The research for this contribution was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy -EXC 2176 'Understanding Written Artefacts: Material, Interaction and Transmission in Manuscript Cultures', project no. 390893796. The research was conducted within the scope of the Centre for the Study of Manuscript Cultures (CSMC) at Universität Hamburg.

Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez LUKaiser IPolosukhin Advances in Neural Information Processing Systems IGuyon UVLuxburg SBengio HWallach RFergus SVishwanathan RGarnett Curran Associates, Inc 2017 30 ARadford KNarasimhan TSalimans ISutskever Improving language understanding by generative pre-training 2018 YGao arXiv:2312.10997 Retrieval-augmented generation for large language models: A survey 2023 arXiv preprint Context tuning for retrieval augmented generation RARamesh TBethi DVodianik SVChappidi EACL Workshop 2023 7 examples of retrieval-augmented generation (rag) JGitlin 2024. 2024-06-30 What is retrieval-augmented generation aka rag? 2024. 2024-06-30 Artefact profiling: Panomics approaches for understanding the materiality of written artefacts MCreydt MFischer 10.3390/molecules28124872 Molecules 28 4872 2023 Introducing meta llama 3: The most capable openly available llm to date MAi 2024. 2024-05-23 Retrieval-augmented generation for knowledge-intensive nlp tasks PLewis EPerez APiktus FPetroni VKarpukhin NGoyal HKüttler MLewis WYih TRocktäschel SRiedel DKiela arXiv:2005.11401 2021 BERT: pre-training of deep bidirectional transformers for language understanding JDevlin MChang KLee KToutanova CoRR abs/1810.04805 2018 SMin XLyu AHoltzman MArtetxe MLewis HHajishirzi LZettlemoyer arXiv:2202.12837 Rethinking the role of demonstrations: What makes in-context learning work? 2022 To Extend or not to Extend? Context-specific Corpus Enrichment FKuhr TBraun MBender RMöller 10.1007/978-3-030-35288-2_29 Proceedings of AI: Advances in Artificial Intelligence AI: Advances in Artificial Intelligence 2019 Fine-tuning bert models on demand for information systems explained using training data from pre-modern arabic TAsselborn SMelzer SAljoumani MBender FAMarwitz KHirschler RMöller Proceedings of the Workshop on Humanities-Centred Artificial Intelligence (CHAI 2023) the Workshop on Humanities-Centred Artificial Intelligence (CHAI 2023) 2023 CEUR Workshop Proceedings ASinghal Modern information retrieval: A brief overview 2001. 2024-06-28 <author> <persName><surname>Openai</surname></persName> </author> <idno type="arXiv">arXiv:2303.08774</idno> <imprint> <date type="published" when="2023">2023</date> </imprint> </monogr> <note type="report_type">Gpt-4 technical report</note> </biblStruct> <biblStruct xml:id="b15"> <analytic> <title level="a" type="main">Augmenting and automating corpus enrichment FKuhr MBender TBraun RMöller 10.1142/S1793351X20400061 Int. J. Semantic Computing 14 2020