<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Special Session on Harmonising Generative AI and Semantic Web Technologies, November</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Guided by Human Training</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jamie McCusker</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Henrique Santos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rishi Singh</string-name>
          <email>rishi.singh@bostonfusion.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabbir M. Rashid</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abraham Sanders</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grace Roessling</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongji Guo</string-name>
          <email>guoh11@rpi.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bashirul Biswas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deborah L. McGuinness</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomek Strzalkowski</string-name>
          <email>tomek@rpi.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qiang Ji</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jay Miller</string-name>
          <email>jay.miller@bostonfusion.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, AI Fusion, Knowledge Graphs, Curriculum Learning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Boston Fusion Corp</institution>
          ,
          <addr-line>Lexington, MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rensselaer Polytechnic Institute</institution>
          ,
          <addr-line>Troy, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>13</volume>
      <issue>2024</issue>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>ARCLIGHT is an AI fusion system that leverages Large Language Models, perception learning, knowledge graphs, and human guidance to describe high-level concept instances with lower-level attributes and afordances. By combining structured models and unsupervised exploration, ARCLIGHT discovers attributes and afordances in both known and unknown objects, entities, or activities. This enables automated novelty detection, curation of a symbolic knowledge graph, and a dialogue agent that asks discriminating questions. The system's perception component utilizes Bayesian models to recognize unknown and novel concepts, flag regions of high epistemic uncertainty, and update the knowledge graph based on user interactions. ARCLIGHT can potentially improve human-machine collaboration and advance artificial intelligence in various fields.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The Automated Clustering and Curriculum Learning Guided by Human Training (ARCLIGHT) system is
able to learn high-level concept instances (objects, entities, activities) with lower-level concept instances
of attributes (features) and afordances (capabilities). ARCLIGHT discovers attributes and afordances in
both known and unknown objects, entities, and activities (see Figure 1 for the knowledge representation)
through a combination of structured models and unsupervised exploration. The structured models
allow for automated novelty detection (e.g., this entity appears to be a dinosaur, but it is purple) while
the unsupervised attribute exploration (through masking for unknown objects and more nuanced
saliency maps and uncertainty attribution for known objects) allows discovery of defining features (e.g.,
cars have wheels). Both approaches allow curation of a symbolic knowledge graph (adding attributes
to objects) as well as enabling a Large Language Model-based dialogue agent to ask discriminating
questions. Our approach is to build on and extend the Whyis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] knowledge graph framework to include
multi-modal neural perception and dialogue systems. By utilizing Bayesian models in the perception
system, unknown concepts can be recognized and regions of high epistemic uncertainty can be flagged.
Novelty or uncertainty can be flagged at the attribute, concept, relationship, or scene level. The dialogue
system can utilize regions of high certainty or uncertainty to highlight features for a user, can query
LGOBE
∗Corresponding author.
      </p>
      <p>https://tw.rpi.edu (J. McCusker); https://tw.rpi.edu (D. L. McGuinness); https://lacailab.cogsci.rpi.edu (T. Strzalkowski);</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>schema:hasPart
schema:ImageSegment</p>
      <p>schema:about
schema:VideoObject</p>
      <p>has part</p>
      <sec id="sec-1-1">
        <title>Object</title>
        <p>subclass of</p>
        <p>Entity
subclass of
has participant
has input
has target
has output
has product
has attribute</p>
      </sec>
      <sec id="sec-1-2">
        <title>Attribute</title>
      </sec>
      <sec id="sec-1-3">
        <title>Time Interval</title>
        <p>exists at
schema:hasPart</p>
      </sec>
      <sec id="sec-1-4">
        <title>Agent</title>
        <p>has agent</p>
        <p>wd:activity
schema:Clip
schema:about
the knowledge graph to reason about an unknown object, and can update the knowledge graph directly
based on user interactions.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. System Overview</title>
      <p>
        ARCLIGHT is implemented as a multiagent system using the Whyis knowledge graph framework’s
autonomous inference architecture [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This means that its components are implemented as agents
– programs that receive knowledge graph (KG) fragments as inputs, process them, and produce KG
outputs that might in turn become inputs to other agents. Such an approach enables the diferent
components to asynchronously interact with one another and scale as needed by allocating agents
to free computing resources as they become available. The ARCLIGHT multiagent environment is
implemented as a collection of agents that expand and create a common knowledge graph. Agents’
actions in this environment create changes in the knowledge graph that other agents might react to.
This produces a chain of input-output relationships that could, for example, represent loosely coupled
media processing pipelines. In the technical descriptions that follow, it is useful to think of the resulting
coordination dynamics between agents as a publish-subscribe system where interaction among agents
is mediated by topics agents publish and subscribe to, even though Whyis does not explicitly implement
that type of coordination pattern. Information and knowledge are stored within a series of databases.
Within Whyis, the Fuseki RDF 1 knowledge graph database (triplestore) stores symbolic knowledge and
the Milvus embedding vector database [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] stores latent representations of concepts within the system.
A multimedia repository is mapped to specific IRIs within the database so that media metadata is part
of the KG. A runtime-configured collection of agents then watch the KG for changes to analyse and
produce new subsequent KG fragments.
      </p>
      <p>In Figure 2, we show a high-level diagram of the diferent services and types of agents in the system
interacting via what is in efect a publish-subscribe model. Since each agent is looking for specific graph
patterns using SPARQL queries, the agents are modular. As a result, multiple agents can be prototyped,
evaluated, and deployed in parallel as desired without disrupting the overall system.
1https://jena.apache.org/documentation/fuseki2/
2.1. Media Analysis Agents
ARCLIGHT uses agents for each perceptual task, which includes classification and simultaneous
uncertainty quantification for objects, activities, attributes, and afordances in incoming multi-modal
data. Additionally, insight agents perform post-hoc analyses on the classification agents and their
output to provide saliency maps, uncertainty attribution maps, and decompose known and unknown
objects, which enables eficient resolution of novelty from oracles and stored knowledge.</p>
      <p>
        The perception agents adopt an open-world detection framework, which separates the task of entity
localization (left side of Figure 2) from classification (right side of Figure 2), enabling the detection of
unknown objects. For localization in the full system, the perception agents will use a common region
proposal network (RPN) combining a baseline RPN similar to He et al [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with the Segment Anything
Model (SAM). SAM is trained on billions of objects across a wide domain and is thus a powerful, largely
class-agnostic tool which can greatly improve the RPN’s representation of “object-ness”. Currently,
individual agents use a separate RPN as part of their architecture to allow better evaluation of SOTA
open-world detection methods. Insight agents perform analyses on perception agents themselves to
probe their models, localize sources of uncertainty, and detect emergent features.
      </p>
      <p>
        The design of the Object Classification Agent focuses on evaluating the use of uncertainty
quantification and attribution for novel object detection in a SOTA deformable detection transformer model
architecture, similar to work from Zohar et al [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for the open world object detection (OWOD) setting.
This architecture uses a common encoder-decoder feeding to both a feature extraction network and a
RPN, which allows it to detect the presence of unknown objects.
      </p>
      <p>
        The Attribute Classification Agent employs feature extractors based on the model and dataset
presented by Ramanathan et al [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This includes separate attribute heads for predetermined attribute
types (color, pattern-marking, material and reflectance) downstream of a Mask R-CNN architecture that
are initially trained on the PACO dataset.
      </p>
      <p>
        Interaction between ARCLIGHT and human users (e.g., analysts, teachers, or evaluators) happens
through a multimodal, instruction-tuned, tool-aware dialogue agent. The dialogue agent is responsible
for initiating conversation with users to facilitate active learning when uncertainty arises from the
system’s perception. It is also responsible for responding to user-initiated conversation to aid in typical
analytic tasks including question answering and reasoning over the media presented by the user. All
discussion is mediated using the Activity Streams [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] vocabulary, so that UIs and dialog agents alike
read and create “Fediverse”-compliant messages and media posts.
      </p>
      <p>
        At the core of the dialogue system is the instruction-tuned Large Language Model (LLM) Llama 3
70b [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. At inference time, the LLM is prompted with instructions pertaining to: (1) its purpose, (2)
operational constraints such as which images the user is viewing, (3) a listing of all tools (e.g., APIs,
databases) at its disposal, (4) contextual knowledge relevant to the conversation, and (5) the current
dialogue history. The LLM then proceeds following a ReAct-like reason-action loop [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to retrieve any
further necessary information (e.g., via a SPARQL query) and respond to the user. Figure 3 illustrates
this process in our user interface, where it is able to retrieve relevant instances from the graph using
text dialogue.
      </p>
      <p>When an agent initiates a conversation with a user following the ingestion of media with high
uncertainty, the ARCLIGHT dialogue agent selects actions that update Whyis with new knowledge
obtained from the interaction. The LLM is responsible for locating the appropriate labels, descriptions,
and other relevant information and crafting the call to a Whyis API to update the ontology accordingly.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion</title>
      <p>The ARCLIGHT system is a knowledge graph-centric AI fusion system that allows users to easily
upload media that can be analyzed by a suite of perception, classification, and dialogue agents, creating
knowledge graph fragments to describe each depicted scene. This graph-augmented knowledge to is
used to drive discussion through an instruction-tuned LLM that knows how to query the graph and
extract relevant knowledge to provide suitable responses to user dialogue. Further, the media, dialogue,
and image knowledge are all represented within the knowledge graph, allowing for comprehensive
explanations of any analysis and tracing of any given source of information. Our hope is that this kind
of modular system can serve as a model for neuro-symbolic learning using perception, language, and
knowledge in a meaningful way.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>McCusker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <article-title>Whyis 2: An Open Source Framework for Knowledge Graph Development and</article-title>
           Research, in:
          <source>The Semantic Web, Lecture Notes in Computer Science</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>538</fpage>
          -
          <lpage>554</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>031</fpage>
          - 33455- 9_
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          , et al.,
          <article-title>Milvus: A purposebuilt vector data management system</article-title>
          ,
          <source>in: Proceedings of the 2021 International Conference on Management of Data</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2614</fpage>
          -
          <lpage>2627</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>J. McCusker</surname>
          </string-name>
          ,
          <article-title>Customizable knowledge graph visualization using the whyis knowledge explorer, in: Visualization and Interaction for Ontologies</article-title>
          ,
          <source>Linked Data and Knowledge Graphs</source>
          <year>2024</year>
          , CEUR Workshop Proceedings,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kortylewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-N.</given-names>
            <surname>Chen</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yuille</surname>
          </string-name>
          ,
          <article-title>Partimagenet: A large, high-quality dataset of parts</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zohar</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-C. Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Yeung</surname>
          </string-name>
          , Prob:
          <article-title>Probabilistic objectness for open world object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>11444</fpage>
          -
          <lpage>11453</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Petrovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kovvuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          , et al.,
          <article-title>Paco: Parts and attributes of common objects</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>7141</fpage>
          -
          <lpage>7151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Snell</surname>
          </string-name>
          , E. Prodromou, Activity Vocabulary,
          <source>W3C Recommendation, W3C</source>
          ,
          <year>2017</year>
          . Https://www.w3.org/TR/2017/REC-activitystreams-vocabulary-
          <volume>20170523</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. P.</surname>
          </string-name>
          et al.,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407. 21783. arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          , I. Shafran,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          , React:
          <article-title>Synergizing reasoning and acting in language models</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations</source>
          ,
          <year>2023</year>
          . URL: https://openreview.net/forum?id=WE_
          <string-name>
            <surname>vluYUL-X.</surname>
          </string-name>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>