<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards the application of case-based reasoning to a system for exploring cultural heritage corpus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Prunelle D. Treuil</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université de Lorraine</institution>
          ,
          <addr-line>CNRS, LORIA, F-54000 Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Harold is a conversational system developed in order to assist the historians working at Henri Poincaré's Archive to explore the mathematician's correspondence. This correspondence is currently stored in a knowledge base using a SPARQL endpoint, which is dificult to use for anyone unfamiliar with SPARQL. Using Harold, the historians are able to delimit any set of letters useful to answer the research question they are currently working on through a serie of interactions with the system. The results of every search are presented in a hierarchy of concepts using formal concept analysis. This allows the user to quickly select the letters or properties that interest them or, conversely, those that they want to remove from the results. Additionally, these concepts can also be the basis for constructing an ontology containing both a hierarchy of properties and a hierarchy of concepts. This ontology allows for the restructuring of the results to show more generalized concepts. A future work to be done on the system is to guide the user through all possible interactions with Harold using CBR. This could be done in several diferent ways: (1) case-based reasoning on traces could be applied to the serie of user interactions, (2) Harold being already a conversational system, conversational case-based reasoning principles could also be used.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Ontology management</kwd>
        <kwd>conversational system</kwd>
        <kwd>formal concept analysis</kwd>
        <kwd>cultural heritage collection</kwd>
        <kwd>case-based reasoning on traces</kwd>
        <kwd>conversational case-based reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The present paper presents work done to implement Harold, a system used to access and analyze a
textual corpus represented by a knowledge graph and structured by an ontology, and to show how
it could be enhanced thanks to the CBR methodology. Harold currently allows for the exploration
of the correspondence of Henri Poincaré, a famous French mathematician (1854-1912). He excelled
in most scientific fields of his time: mathematics, philosophy of science, physics, chemistry, celestial
mechanics, etc. and was in contact with numerous scientific circles, making his correspondence
particularly insightful to study. This correspondence is accessible both on a website and via a SPARQL
endpoint. This endpoint gives direct access to the RDF triples structuring the corpus, the triples
subjectproperty-object (or value) used to describe the knowledge graph. Using the SPARQL endpoint is a
dificult task for the historians working on the letters, as they neither know how to write SPARQL
queries nor the vocabularies used for both the knowledge graph and the ontologies. Harold allows non
SPARQL specialists, such as Henri Poincaré Archives historians, to explore this corpus and to exploit
the knowledge on each letter in the knowledge graph used to structure the corpus to improve their
understanding of it. With this system, users can create their own domain-focused ontology and use it
to fill the graph with new concepts and properties to better describe each letter.</p>
      <p>
        Currently, Harold does not use case-based reasoning, but several ways to incorporate it are considered.
(1) Since Harold is built to foster an iterative search process, the history of all user interactions could be
exploited thanks to a case-based reasoning system to guide the user in their next step. The idea would
be that Harold could give them suggestions such as “add the letters with these properties”, “remove
all the letters from this person”, “regroup these two concepts into a more general concept”, etc. (2) As
Harold is already designed as a conversational system, an approach using conversational case-based
reasoning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] principles could be imagined. Harold’s users do not often have a clear idea of how to
solve a given research question and need several steps to refine their initial query.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Research Plan</title>
      <sec id="sec-2-1">
        <title>2.1. Research Objectives</title>
        <p>
          Harold’s goal is to accompany a historian during their exploration of a corpus, namely Henri Poincaré’s
correspondence, from their research question to the writing of a scientific publication. For example, a
historian could have an interest in the Dreyfus afair, a French political scandal that straddles the 19ℎ
and 20ℎ centuries. After using Harold, they could explain the involvement of Henri Poincaré in the
trial that followed based on the letters that he exchanged with the main actors in this afair [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. If the
beginning of this thesis was focused on the implementation of Harold, the next step of this work should
revolve around the integration of case-based reasoning into the system to improve its usefulness for
historians of the Henri Poincaré Archives.
        </p>
        <p>Harold provides multimodal access to the corpus, allowing the user to select a precise set of letters
depending on the properties they should or should not have: metadata like the sender or recipient, the
persons quoted, the terms used both in the text and in the annotations, or other data such as formulas,
graphs, and drawings present in the original documents.</p>
        <p>With the set of selected letters, Harold highlights new pieces of information on the corpus, shows
patterns, and potential interesting concepts that the historian would not necessarily have seen by
themselves. This analysis can be the basis for the construction by the user of an ontology, including
both a hierarchy of properties and a hierarchy of concepts. This ontology then impacts the precision
and usefulness of the results of every search done on the corpus in a positive feedback loop until the
user improves their understanding of their data enough to answer their initial research question.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Approach / Methodology</title>
        <p>The choice has been made for Harold to be a conversational system in the sense that it allows iterative
interactions between the user and the system to solve a given problem. Indeed, working on a quite
complex corpus and with often very short documents, most letters require the expertise of Harold’s
user in order to be correctly interpreted. For example, if someone is trying to find all the letters related
to Henri Poincaré’s work on physics, most of them will not contain the term “physics”. The user will
therefore need to search for additional letters by a new query or remove some letters or properties from
the results, using their own knowledge. In the case of letters related to physics, a specialist could add
letters exchanged with physicists and physics institutions that they already know to have a contact
with the mathematician.</p>
        <p>To do so, Harold’s user interface is composed of several parts. First, a form containing diferent fields,
for each property describing a document, allows the user to either start a new search from the ground
up or to add new documents to the current search. The properties used are the metadata describing the
letters: sender, recipient, writing place, language, and writing date, as well as some textual data: the
persons and groups of persons quoted in the text and some candidate terms extracted from the letters
and their annotations produced by the historian, called the critical apparatus. These candidate terms
could be both automatically extracted or manually selected by historians. A second part of the interface
shows in a synthetic way all the properties and letters selected by the user and all of them that should
not be kept. From this history, SPARQL queries are built, and their results are displayed as a hierarchy
of concepts obtained using formal concept analysis. In this context, a concept is understood as a pair
(set of letters, set of properties) such that all letters possess all the properties and all the properties are
possessed by all the letters. This allows the user to quickly discover the main properties shared by the
letters they have selected.</p>
        <p>A specific part of the user interface allows ontology management. As seen in Figure 1, it contains
both a hierarchy of properties and a hierarchy of concepts. These properties and concepts can be
(a) Hierarchical presentation of the results
after starting a new search on the letters
containing the term “physique”.
(b) The results after having taken into account the
concept hierarchy of Figure 1.
retrieved from the results section or be created manually. Once the hierarchies are organized properly,
they can be used to generalize the properties of every letter. For example, if a letter contains in its text
the word “problème de Dirichlet” (in english “Dirichlet problem”), then it should be generalized that
one of the letters’ topics is related to mathematics. Using this generation before triggering the formal
concept analysis allows Harold to find more general concepts. Even if the letters do not contain the
words “mathematics” or “physics”, the user can iteratively create both concepts to find all the letters
related to both of these topics.Figure 2 shows the impact of the hierarchies on the letters containing
the term “physique”. Figure 2a presents the result of this search without taking the hierarchies into
consideration, and in Figure 2b, the results that take them into account can be seen? Finally, a last part
of the interface allows the user to visualize in more detail the letters found by their search with their
data and a link to the Henri Poincaré Archives’ website with the scan of the letter, its transcription and
possible additional images. This allows the user to study each individual letter in order to answer their
initial research question.</p>
        <p>
          The system should be able to guide the user in the diferent possible interactions. Indeed, Harold
ofers many diferent possible interactions and it is not always apparent which one would produce the
best results. As such, the system should propose some possible next steps for the user in order to bring
them closer to their research question. Diferent approaches are possible to do so, similarity search,
using LLMs to propose new additions to the ontology, and so on. An interesting possibility would
also be to use case-based reasoning to find next steps potentially useful to the user. The history of all
the interactions performed by Harold’s users in each of their sessions could give some insight as to
which step the user could want to carry next. As this system has not yet been developed, no particular
CBR technique has been selected to implement it. Some possibilities could be explored: (1) First, since
Harold is already a conversational system, some principles from works on conversational CBR [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] could
be reused to take advantage of the already existing interactivity of Harold. (2) Another possibility
would be to apply case-based reasoning on traces approaches [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] on the list of successive interactions a
user had with Harold to introduce them to potential new steps. This can also fall under the scope of
process-oriented CBR [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. (3) With the development of Natural Language Processing (NLP) techniques,
new methods using CBR principles for question answering on knowledge base can be imagined [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ].
(4) Finally, textual CBR [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] could be used to improve text mining on the letters in order to better retrieve
their underlying themes.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Progress Summary</title>
      <p>So far, Harold allows users to explore iteratively and interactively Henri Poincaré’s correspondence.
The ontology management interface has been implemented, and the ontology can be used to create
more general concepts shown in the results. Harold is accessible online for testing on Henri Poincaré
corpus at https://harold.ahp-numerique.fr/.</p>
      <p>Even if Harold could be applied to other corpus, it is primarily created for the Henri Poincaré
correspondence, and as such the historians of the Henri Poincaré Archive are present throughout
its development, guiding its design and functionalities and testing it thoroughly. Several evaluation
sessions have been done and will be done with them. Their expertise ant time will be useful for any
implementation of a CBR system in Harold.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future Work</title>
      <p>As Harold is now online, the historians working on the Henri Poincaré corpus have started to test the
system. Harold will also be used on several other corpora, including (a) the corpus of the correspondence
of another scientist and (b) a set of scientific papers on a given domain with the purpose of helping to
build a state of the art if this domain.</p>
      <p>To expand the multimodal access to the corpus provided by Harold, and since the corpus used with it
is mainly scientific, an interesting next step would be to use the formulas presented in the documents.
The formulas contained in the Poincaré corpus are currently encoded in LATEX. The idea would be to
automatically link a formula with some concept of the ontology. For example, a ∇ or  would indicate
a diferential equation, ∫︀ , an integral, etc. This would allow us to extract more information from the
letters.</p>
      <p>Finally, the question of how to guide the user through all the diferent interaction possibilities in the
interface is still open. The idea is that the historian has a research question to solve, and the system
should suggest an eficient next step to complete this goal. As mentioned above, at least two CBR
approaches can be considered to solve this: applying trace-based CBR on the user interaction history in
order to propose relevant new interactions to the historian or developing the conversational approaches
of Harold to both select the letters needed to answer the user’s research question and the ontology
used to structure them. Other CBR approaches could probably also be used to guide the user in their
interactions with Harold such as textual CBR, process-oriented CBR or CBR for question answering on
knowledge base.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The author thanks the reviewers for their insightful suggestions and the papers they recommended,
which have introduced us to new perspectives.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used Writefull for the purpose of: Grammar and spelling
check, Paraphrase and reword. After using this tool, the author reviewed and edited the content as
needed and takes full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Richter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. O.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Richter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. O.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <surname>Conversational</surname>
            <given-names>CBR</given-names>
          </string-name>
          ,
          <source>Case-Based Reasoning: A Textbook</source>
          (
          <year>2013</year>
          )
          <fpage>465</fpage>
          -
          <lpage>485</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rollet</surname>
          </string-name>
          , Autour de l'afaire Dreyfus. Henri Poincaré et l'action politique,
          <source>Revue Historique</source>
          <volume>298</volume>
          (
          <year>1997</year>
          )
          <fpage>49</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Aha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Breslow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Muñoz-Avila</surname>
          </string-name>
          ,
          <source>Conversational Case-Based Reasoning, Applied Intelligence</source>
          <volume>14</volume>
          (
          <year>2001</year>
          )
          <fpage>9</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cordier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lefevre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-A.</given-names>
            <surname>Champin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Georgeon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mille</surname>
          </string-name>
          ,
          <source>Trace-Based Reasoning - Modeling Interaction Traces for Reasoning on Experiences, Twenty-Sixth International Florida Artificial Intelligence Research Society</source>
          Conference (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Minor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Recio-García</surname>
          </string-name>
          ,
          <article-title>Process-oriented case-based reasoning</article-title>
          ,
          <source>Information Systems</source>
          <volume>40</volume>
          (
          <year>2014</year>
          )
          <fpage>103</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Godbole</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Naik</surname>
            , E. Tower,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zaheer</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Hajishirzi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mccallum</surname>
          </string-name>
          ,
          <article-title>Knowledge Base Question Answering by Case-based Reasoning over Subgraphs</article-title>
          , in: K. Chaudhuri,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jegelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szepesvari</surname>
          </string-name>
          , G. Niu, S. Sabato (Eds.),
          <source>Proceedings of the 39th International Conference on Machine Learning</source>
          , volume
          <volume>162</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4777</fpage>
          -
          <lpage>4793</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Lu, GS-CBR-KBQA: Graph-structured case-based reasoning for knowledge base question answering</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>257</volume>
          (
          <year>2024</year>
          )
          <fpage>125090</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. O.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Ashley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brüninghaus</surname>
          </string-name>
          ,
          <article-title>Textual case-based reasoning</article-title>
          ,
          <source>The Knowledge Engineering Review</source>
          <volume>20</volume>
          (
          <year>2005</year>
          )
          <fpage>255</fpage>
          -
          <lpage>260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cordier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lefevre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-A.</given-names>
            <surname>Champin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mille</surname>
          </string-name>
          ,
          <article-title>Modéliser les traces d'interaction pour raisonner partir de l'expérience tracée ?</article-title>
          ,
          <source>in: IC - 24èmes Journées francophones d'Ingénierie des Connaissances</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>