<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. S. Erickson);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ChatBS: An Exploratory Sandbox for Bridging Large Language Models with the Open Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>John S. Erickson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Henrique Santos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jamie McCusker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sola Shirai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deborah L. McGuinness</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Hendler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Submission Type: Demo.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tetherless World Constellation, Rensselaer Polytechnic Institute</institution>
          ,
          <addr-line>Troy, NY, 12180</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1960</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The recent widespread public availability of generative large language models (LLMs) has drawn much attention from the academic community to run experiments in order to learn more about their strengths and drawbacks. From prompt engineering and fine-tuning to fact-checking and task-solving, researchers have pursued several approaches to try to take advantage of these tools. As some of the most powerful LLMs are “closed” and only accessible through web APIs with prior authorization, combining LLMs with the open web is still a challenge. In this evolving landscape, tools that can facilitate the exploration of the capabilities and limitations of LLMs are desirable, especially when connecting with traditional web features such as search and structured data. This article presents ChatBS, a web-based exploratory sandbox for LLMs, working as a front-end for prompting LLMs with user inputs. It provides features such as entity resolution from open knowledge graphs, web search using LLM outputs, as well as popular prompting techniques (e.g. multiple submissions, “step-by-step”). ChatBS has been extensively used in Rensselaer Polytechnic Institute's Data INCITE courses and research, serving as key tool for utilizing LLMs outputs at scale in these contexts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Generative large language models (LLMs) have become pervasive in recent years [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], permitting
free access to virtually any web user. They are inarguably revolutionary in their ability to
generate human-like text with unprecedented scale and speed. Largely led by the industry,
these LLMs are trained using vast amounts of web-scraped data, and little is known about
exactly what data and how it is used in their training. This proliferation has ignited considerable
interest within the academic community, prompting researchers to conduct experiments aimed
at exploring the strengths and limitations of these powerful tools (e.g. morality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], kbqa [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
However, a key challenge arises due to the fact that some of the most formidable LLMs are
considered “closed” and are only accessible through web APIs with prior authorization. This
presents an obstacle to efectively combining content from LLMs with the open web, an issue
that remains a central concern in the evolving landscape of web science. The development of
tools that can facilitate the exploration of the capabilities and limitations of LLMs, especially
when integrated with traditional web features like search and structured data, has become
increasingly desirable.
      </p>
      <p>ChatBS serves as a front-end for prompting LLMs with user inputs, ofering functionalities
to support experimentation and research. On the input side, ChatBS allows users to quickly
incorporate popular prompting techniques, such as making multiple submissions and employing
a “step-by-step” approach. On the output side, ChatBS includes a web search using
LLMgenerated content and entity resolution against open knowledge graphs. We provide an overview
of ChatBS, its functionalities, and its impacts, emphasizing its potential to bridge the gap between
LLMs and the open web.</p>
    </sec>
    <sec id="sec-3">
      <title>2. ChatBS: An Exploratory Sandbox</title>
      <p>ChatBS was conceived as an accessible, extensible platform for interactive experimentation
with LLMs such as, initially, GPT 3.5 and, more recently, GPT 4. It was designed to empower
users to easily conduct sophisticated prompt experiments that might be dificult or impossible
through the normal ChatGPT user interface, and would normally be done by directly accessing
an API (esp. the OpenAI API) programmatically. ChatBS features include:
• LLM Selection: Users can easily select between models from the GTP-4 and GPT-4o
families. In the future, OpenAI and other models will be added or dropped depending on
availability. We plan to diversify model availability by adding openly available models,
such as the recently-released Llama 3.1 model [4], as options.
• System Prompt Customization: To support robust prompt experimentation allowing
users to pre-define the context, scope, guardrails, or output format for the model to
use during an interaction.
• User Prompt Customization.
• Chain-of-Thought Facilitation: Standardized prompt modifications to request models to
explain step-by-step.
• Explanation enhancement: Standardized user prompt append to request references to
support generated content.
• Repeated Prompt Submission: To support scalability in the generation of datasets with
multiple responses to the same standardized prompt.
• Relationship Detection, Entity Resolution, and RDF Generation: ChatBS parses and links
the generated answers to entities in Wikidata, so users can contrast claims.
• Dataset generation: To document experiments and to make the results more shareable
and replicable, both the main ChatBS results and the RDF results (if requested) are
downloadable as JSON(-LD) files.</p>
      <p>ChatBS was initially conceived as a compelling and fun way to fact-check LLM-generated
answers to questions. The earliest versions of ChatBS (2023) demonstrated the ability to identify
asserted relationships between entities within the answers; to resolve entities and relationships
(predicates) to URIs; and to express these result graphs as RDF in JSON-LD format. For each
obtained output, ChatBS employs a relationship discovery and entity linking service that
Submit questions
to OpenAI
Completion API:
Custom "System"
and "User"
prompts
Up to 10 prompt
re-submissions</p>
      <p>Option to append
user prompts for
"chain-of-thought"
reasoning
evaluation</p>
      <p>Embedded
Wikidata entity
summary (using
AutoDesc API) of
selected entity…to
aid in human fact
checking!
Downloadable JSON-LD
of complete results graph</p>
      <p>Each sentence in
results is automatically
linked to a Google
query, proving a quick
"sanity-check"
Downloadable JSON
results file includes
complete prompts
RDF generation is
optional, to facilitate
question answering
parses and analyses the generated content to produce a graph with the LLM’s claims. For this,
we implemented the approach in [5] which involves the use of pre-trained language models
for entity and relationship recognition and naive entity linking to establish connection with
Wikidata.</p>
      <p>ChatBS, as a web application, is entirely accessible through any modern web browser. Its
architecture comprises four basic elements: UI, OpenAI’s continuation API, relationship
discovery and entity-linking, and the Wikidata KG. The UI (Fig. 1) is developed as an R Shiny [6]
app, providing a textual input for users to enter a natural language question about something
(just like a user would do when using ChatGPT) and a parameter to set the desired number
of answers for the same question (this feature was incorporated to stress the potential
diferent answers, sometimes conflicting, LLMs can generate for the very same prompt). Original
responses from the LLM are displayed on the right-hand side panel. Generated triples produced
by the relationship discovery module, composing the graph representing the responses’ claims,
as a table, are optionally displayed by the use of a switch. In addition, a Wikidata snippet
conveniently displays the Wikidata web page for any of the entities or relationships linked by
the entity-linking module by clicking on any of them. This snippet can be leveraged by users to
contrast claims in the triples table against crowd-sourced information in Wikidata, for instance.
ChatBS’s architecture is summarized in Fig. 2.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Impact and Conclusion</title>
      <p>ChatBS has proven helpful in a variety of ways beyond its initial goal as an LLM fact-checker.
As a clean, eficient LLM client optimized for experimentation, it has been especially useful for
the ad hoc replication of LLM-based research results, for evaluating potential research methods,
including prompt engineering strategies, and as a research tool unto itself. ChatBS has been
employed in the classroom to support the reproducibility of published research by students
within Rensselaer Data INCITE [7] (e.g. DecodingTrust [8] and BBQ [9]). In addition, it has been
used to support original publications in diverse domains, such as finance [ 10] and health [11].</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The members of the ChatBS team would like to express their gratitude to their Tetherless
World Constellation and Future of Computing Institute colleagues for their ongoing testing and
feedback of ChatBS. We would also like to thank the students of Prof. Kristin Bennett’s “Data
Analytics Research” course (MATP-4910, Fall 2023) who utilized ChatBS in their group projects
in a variety of creative ways.
LLM Family, in: T. R. Payne, V. Presutti, G. Qi, M. Poveda-Villalón, G. Stoilos, L. Hollink,
Z. Kaoudi, G. Cheng, J. Li (Eds.), The Semantic Web – ISWC 2023, Lecture Notes in
Computer Science, Springer Nature Switzerland, Cham, 2023, pp. 348–367. doi:10.1007/
978- 3- 031- 47240- 4_19.
[4] Introducing Llama 3.1: Our most capable models to date, 2024. URL: https://ai.meta.com/
blog/meta-llama-3-1/.
[5] J. McCusker, LOKE: Linked Open Knowledge Extraction for Automated Knowledge Graph</p>
      <p>Construction, 2023. doi:10.48550/arXiv.2311.09366, arXiv:2311.09366 [cs].
[6] W. Chang, J. Cheng, J. Allaire, C. Sievert, B. Schloerke, Y. Xie, J. Allen, J. McPherson,
A. Dipert, B. Borges, shiny: Web Application Framework for R, 2024. URL: https://shiny.
posit.co/, r package version 1.8.1.9000, https://github.com/rstudio/shiny.
[7] The Rensselaer Data Informatics Challenges in Technology Education Pipeline (Data</p>
      <p>INCITE), 2024. URL: https://idea.rpi.edu/research/projects/data-incite.
[8] B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, C. Xu, Z. Xiong, R. Dutta, R. Schaefer,
S. T. Truong, S. Arora, M. Mazeika, D. Hendrycks, Z. Lin, Y. Cheng, S. Koyejo, D. Song,
B. Li, DecodingTrust: a comprehensive assessment of trustworthiness in GPT models,
in: Proceedings of the 37th International Conference on Neural Information Processing
Systems, NIPS ’23, Curran Associates Inc., Red Hook, NY, USA, 2024, pp. 31232–31339.
[9] A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thompson, P. M. Htut,
S. Bowman, BBQ: A hand-built bias benchmark for question answering, in: S. Muresan,
P. Nakov, A. Villavicencio (Eds.), Findings of the Association for Computational Linguistics:
ACL 2022, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 2086–2105.
doi:10.18653/v1/2022.findings- acl.165.
[10] M. Lashuel, G. Kurdistan, A. Green, J. S. Erickson, O. Seneviratne, K. P. Bennett, LLM-Based</p>
      <p>Code Generation for Querying Temporal Tabular Financial Data, In review (2024).
[11] K. Bhanot, J. S. Erickson, K. P. Bennett, MortalityMinder: Visualization and AI
Interpretations of Social Determinants of Premature Mortality in the United States, Information 15
(2024) 254. doi:10.3390/info15050254, number: 5 Publisher: Multidisciplinary Digital
Publishing Institute.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond</article-title>
          ,
          <source>ACM Trans. Knowl. Discov. Data</source>
          <volume>18</volume>
          (
          <year>2024</year>
          )
          <volume>160</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>160</lpage>
          :
          <fpage>32</fpage>
          . doi:
          <volume>10</volume>
          .1145/3649506.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dodge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sakaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Forbes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Borchardt</surname>
          </string-name>
          , S. Gabriel,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsvetkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <source>Can Machines Learn Morality? The Delphi Experiment</source>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2110.07574, arXiv:
          <fpage>2110</fpage>
          .07574 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Qi,
          <string-name>
            <surname>Can ChatGPT Replace Traditional KBQA Models</surname>
          </string-name>
          ?
          <article-title>An In-Depth Analysis of the Question Answering Performance of the GPT</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>