<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modeling the Hebrew Bible : Potential of Topic Modeling Techniques for Semantic Annotation and Historical Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mathias Coeckelbergs</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seth van Hooland</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universite Libre de Bruxelles Departement des Sciences de l'information et de la communication Avenue F. D. Roosevelt</institution>
          ,
          <addr-line>50 CP 123 B-1050 Bruxelles</addr-line>
          ,
          <country country="BE">Belgique</country>
        </aff>
      </contrib-group>
      <fpage>47</fpage>
      <lpage>52</lpage>
      <abstract>
        <p>Providing useful and e cient semantic annotations is a major challenge for knowledge design of any body of text, especially historical documents. In this article, we propose Topic Modeling as an important rst step to gather semantic information beyond the lexicon which can be added as annotations in the SHEBANQ. By laying out a case study, we discuss both noise and structure found in comparing topics extracted within di erent distributions, and show the value of such approach, which we label a topic hierarchy. We also show a rst result in applying such approach to study diachronic variety in the Bible, and show how this overall Topic Modeling approach can result in more query options for users of the database.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Keywords
Historical Ontology, Natural Language Processing, Semantic</p>
      <p>Annotation, SHEBANQ, Topic Modeling</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction 1</title>
      <p>
        1.1
probably one of the most well-known projects. Other projects focus on the
annotation of texts with linguistic tags allowing in-depth queries into the surface structure
of the text, such as for example the Bibleworks 102 software which contains lexical,
morphological and accentual tags. Syntactic tags are not available here, but can
be found for example in Accordance3 and Logos4, as well as in SHEBANQ5. This
recent database, developed by the Eep Talstra Centre for Bible and Computer,
contains the corpus and the metadata which are the focus of the work presented in this
paper. These projects have since their inception seen an increasing uptake in the
scholarly community, albeit mainly for close reading practices. This term refers to
the traditional humanistic method of reading and studying in detail a speci c part
of a corpus, which generally stems from a well delimited academic canon. This
article is to be placed in the general project of moving towards distant reading methods
for the study of the Hebrew Bible, as developed by Moretti [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Projects such as
Perseus and SHEBANQ facilitate detailed reading and interpretation of texts due
to the ease of access to the documents and the hyperlinks to grammars, dictionaries
and concordances. However, recent developments from the Semantic Web and the
NLP communities hold the promise of going well beyond a mere presentation of
texts and hyperlinks for human consumption.
1.2
      </p>
      <sec id="sec-2-1">
        <title>Research corpus: SHEBANQ</title>
        <p>
          The System for HEBrew text: ANnotations for Queries and Mark-up (SHEBANQ)6
constitutes our core corpus [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This is an online platform allowing a systematic
study of the Hebrew Bible, centered around the idea of user annotations, enriching
the text with a multitude of additional information, most notably queries. Until
now, interest has mainly focused on expanding the syntactic descriptive granularity
of the text. The long-term goal of the project is to create an ever-expanding platform
including a critical apparatus of metadata annotations which need not be limited to
syntactic data. The logical next step is to explore possibilities of adding semantic
information surpassing the lexical level. For the given corpus, this is a daunting task
due to the enormous variation of texts extant in the Hebrew Bible, which hamper
simple application of current text mining techniques on the semantic level such as
Named Entity Recognition (NER) and Topic Modeling (TM).
        </p>
        <p>As already stated, the main databases apart from the SHEBANQ are the
Logos Bible Software, Accordance and Bibleworks. Together they constitute a useful
digital source of the main Hebrew texts, also non-biblical, concerning morphology,
lexicon and syntax. These texts can easily be exported to other document
extensions such as .doc or .txt, but a general unicode version of the texts can be found on
the site of the German Bible Society7. Concerning the semantics of the Biblical text,
important work is being done by Reinier De Blois in the Semantic Dictionary of
Biblical Hebrew8. This project does not focus on the historical development of word
meaning, a task taken up by the Historical Dictionary Project at the University of
Jerusalem9. Scholarly literature on these issues are gathered by the Semantics of
Ancient Hebrew Database10.</p>
        <p>
          An important improvement of the SHEBANQ in comparison to other Hebrew
semantic information is its novel way of representing data. For this, it uses the
Lin2http://www.bibleworks.com
3http://www.accordancebible.com
4https://www.logos.com
5https://shebanq.ancient-data.org
6Ibid.
7https://www.dbg.de/
8http://www.sdbh.org/
9http://hebrew-academy.huji.ac.il/English/HistoricalDictionaryProject/
10http://www.sahd.div.ed.ac.uk/
guistic Annotation Framework [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Its main asset for our purposes is the possibility
of stand-o markup, whereby the primary text is left untouched and annotations
are added as feature structures. The whole le then is structured as a graph,
consisting of nodes and edges. Nodes are assigned to every individual constituent of
the le, both in the primary text as well as in the annotations. Edges are used to
connect nodes into an ever larger picture, linking relationships between constituents
and entities.
2
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Integration of the Hebrew Bible within the Se</title>
      <p>mantic Web community
2.1</p>
      <sec id="sec-3-1">
        <title>General Project and the Role of Topic Modeling</title>
        <p>
          The Topic Modeling approach goes both beyond the level of surface forms and
beyond problems of spelling di erences for example, to a more abstract level of
semantic entities that cannot be readily found by straightforward textual search.
Building salient models for a given corpus is not an easy task, as hallmarked for
example by [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. A primary task will consist in nding adequate models for
each of the Biblical books, which diverge enormously regarding length, content and
style. Hence, we will have to study the importance of the amount of topics per book,
manual and automatic assignments of topics, probability of co-occurrence of terms
in topics of varying length, and other questions regarding the basic architecture of
our model.
        </p>
        <p>Once we have gathered the necessary information on the topic architecture, we
can proceed to annotating these data into the SHEBANQ. Several possible ways
exist to proceed in this task. One viable way consists of annotating the con dence
factor with which a word or sentence belongs to a certain topic as metadata, and
work out how topic assignment on a higher level (paragraph or an entire book)
can proceed from this information. Another possibility of annotating is to focus on
the other words that receive high probability of co-occurring with a given word in
a certain topic. Given a word, related words can be annotated, so that discourse
structures can be evaluated on the signi cance with which they represent a topic.</p>
        <p>The purpose of annotating topics shares an overlap with that of a concordance,
namely to have a concrete idea of the contexts in which a certain word appears.
The rst digital version was famously created by Father Roberto Busa S.J., with his
monumental work on the Corpus Thomisticum11. This is a searchable database of
all words written by Thomas Aquinas and related authors, constituting a resource of
about 11 million words. Classical concordances, either on paper or digital, present
context and use of words by listing an (exhaustive) list of sentences representing the
semantic range of a concrete word. Our annotated topics will not place the word in
reference to concrete sentences, but to a probability with which they co-occur with
other words within several topic distributions.</p>
        <p>
          This addition of TM to the database should result in the improvement of tasks
readily available, such as for example query expansion. With this step, we wish to
contribute to the analysis of TM as a method for data discovery, which has been
both received positive and negative critiques. Positively, it allows for new ways
of comparing and discussing texts, whereas negatively these possibilities may be
overestimated to form misleading conclusions. Problems lie in the interpretation
of clustered words together as constituting a coherent topic, whereas this is not
necessarily always so clear-cut. With our topic hierarchy method, described below,
we wish to address this issue in a novel way. Continuing with this approach, we
want to outline as well how we can use this extracted information to automatically
create authority data which can be used to link the data in our database to other,
related information. This can be contained in a variety of resources, including most
notably research articles, online fora, blogs and social media. The JSTOR Labs
API 12 allows us access to a fuzzy text matching algorithm for linking secondary
literature citations of Shakespeare to the original text. A similar approach is also
possible for the Hebrew Bible, which can be linked to our topic hierarchy. Our main
purpose will be to allow users of the Bible the highest level of querying freedom as
possible, in order to contribute to a digital critical edition of the Bible which uses
current text mining tools to their fullest potential. In the end, the TM architecture
described in the next chapter constitutes a rm basis as pre-training before applying
NER to the corpus, as done recently in for example [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] This is an important asset
for a contribution to this recent eld of inquiry.
2.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Proof of Concept</title>
        <p>In this subsection we describe our rst attempts of building a topic hierarchy, by
extracting di erent topic distributions from the Hebrew Bible and comparing their
inherent relationship and usefulness. As a training set we used the book of Genesis, a
relatively large book which incorporates both narrative and poetic parts, to test the
LDA Algorithm on the other books of the Hebrew Bible. The study of model making
on the basis of ancient Hebrew literature, and more speci cally the importance of
adapting training and test data to each other are in need of further elaboration still,
and our results may be skewed because of our particular choice. For the execution of
this algorithm, we rely on the Mallet software13 for its ease of adaptation of feature
values. Of course, further in the future other software and algorithms will have
to be subject of investigation as well. Tests comprised the creation of several topic
models of varying amount of topics (a so-called distribution) to be found in the text.
In accordance with our expectations,
the fewer topics that needed to be found
in the text, the less coherent the topics
seem to be. As a topic model is
nothing more than a group of words which
are probabilistically clustered together,
human interpretation of these clusters
is a problem in itself, an issue which is
out of scope of the present article.</p>
        <p>Looking at our rst results
comparing di erent topic distributions (going
from 30 to 90 topics with increments
of 10), we notice both chaos and
interesting structure14. The more words
two topics across distributions have in
common, the darker the line connecting
them is coloured. In gure 1, we have</p>
        <p>ltered out lines connecting only a few
words in order to preserve clarity. We
see that some topical structures are
apparent in all distributions, sharing a re- Figure 1: Most important similarities of
markable amount of words. This is for topic distributions ranging from 30 to 90
example the case for topics concerning topics with increments of 10
kingship. Common words for this topic
12https://labs.jstor.org/apis/docs/
13http://mallet.cs.umass.edu/
14We would like to thank Dirk Roorda for his help with visualising topic hierarchies
include Hebrew words for king (mlk), conjugated forms of the verb to rule (mlk) and
throne (ks'). Small distributions also include di erent named lands, such as Edom
and Israel, which appear to be ltered out in higher ones. This can be explained
because low level distributions still contain clustered words with a relatively large
distance to the cluster centre.</p>
        <p>Looking at the noise we found in our dataset, further investigation of primary
results will allow us to netune the model and decrease the amount of noise. One
of these points of inquiry is to nd a balance between treating every occurrence of
words (in all in ections, with conjunctions, etc.) as separate words, or dealing only
with lexemes. The latter shows ner clusters at some points, for example clustering
more negrained topics, containing more speci c lexemes, concerning kingship,
already at a lower distribution. On the other hand, nding meaningful clusters is more
di cult on the lexeme level, because words occurring in similar contexts, derived
from the same lexeme, can no longer be clustered together, which would provide
topics easily recognisable for humans. Next to considering the relation between
words and lexemes, we must also focus on removing other words with low semantic
value, such as conjunctions, adverbs and the like, because they result in less
meaningful clusters when taken up. This is more di cult for Hebrew than for English,
because for example the conjunction waw is added to the beginning of the word,
never appearing separately. We cannot simply break this loose from the word it is
attached to, because in Hebrew, this conjunction is also part of the pragmatic value
of a sentence, making it hard to treat both verbs, with and without conjunction, as
equal. A nal point we have to further investigate is how words from coarser topics
in low level distributions are reclustered in higher level distributions. For example,
we can see that a low level topic related semantically to both ghting and dying
is in higher distributions split up in a topic dealing mainly with words revolving
around ghting, and another with only words in the semantic eld of dying.
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Collecting Vocabulary for the Preparation of an Ontology</title>
        <p>As stated in our introduction, we believe that our Topic Modeling approach can
assist in developing new insights into the diachrony of the Hebrew literature, which
adds important information for an historical ontology of Hebrew literature. Early
results show that typical words for a certain time period in Hebrew tend to be
clustered together if they are low in semantic meaning, such as conjunction words.
Results to what this can mean for shifts in concepts such as kingship are still
ongoing, and it is as yet too early to draw primary conclusions. Expected improvement is
situated mainly on di erences between extracted topics dealing with a similar
content from all books of the Bible independently, in comparison to topics extracted
from the entire Bible.</p>
        <p>The value of our approach for this enquiry can be seen in for example the
evolution of poetic writing, the di erent ways of speaking about enemies before and
after exile, and the reception of Israelite history in the books of kings in comparison
to the books of Chronicles. General scholarship accepts the view that the latter is
a rewriting in later times of the former. This is a unique asset for studying the
diachronic development of language use, which can be seen in change in vocabulary
and orthography, but also in the evolution of the semantic eld used to speak about
key issues such as kingship, victory and rules. However, to date no clear description
of this intricate development has been shown using modern tools.
In this article, we hope to have contributed to a new way of investigating Hebrew
literature through the use of Topic Modeling. The basic methodology and
architecture of our approach was described and we pointed out the possibilities of how
this approach can help to detect historical relations within the text. Future work
will be situated on two fronts. On the one hand, we will use previous work done
at JSTOR labs linking works of Shakespeare and their scholarly work, to allow a
similar approach to the Hebrew Bible, and study how this knowledge can be used in
conjunction with our topic hierarchy to allow the user more query options. On the
other hand, we will try to improve topic extraction from the Bible, discussing the
relationship between concrete words versus lexemes, the similarity between similar
topics from di erent distributions, and several smaller levels of scope to perform
Topic Modeling on, such as individual Bible books, chapters, or other meaningful
units.</p>
        <p>S.:</p>
        <p>Semantic
Open Data,</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bod</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A New History of the Humanities. The Search for Principles and Patterns from Antiquity to the Present</article-title>
          . Oxford University Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Moretti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Graphs</surname>
          </string-name>
          , Trees,
          <source>Maps. Verso</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Roorda</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The Hebrew Bible as Data: Laboratory, Sharing, Experiences</article-title>
          .
          <source>ArXiv:1501</source>
          .
          <year>01866</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Eckart</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <string-name>
            <given-names>A Standardized</given-names>
            <surname>General</surname>
          </string-name>
          <article-title>Framework for Encoding and Exchange of Corpus Annotations: The Linguistic Annotation Framework, LAF</article-title>
          .
          <source>In: Proceedings of KONVENS, GAI</source>
          , pp.
          <fpage>506</fpage>
          -
          <lpage>515</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Efron</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Organisciak</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fenlon</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Building Topic Models in a Federated Digital Library through Selective Document Exclusion</article-title>
          .
          <source>In: Proceedings of the American Society for Information Science and Technology</source>
          vol.
          <volume>48</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Brauer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedlund</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Historicizing Topic Models. A Distant Reading of Topic Modeling Texts within Historical Studies</article-title>
          . In: International Conference on Cultural Research in the context of ?Digital Humanities?, pp.
          <fpage>152</fpage>
          -
          <lpage>163</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Mimno</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Computational Historiography: Data Mining in a Century of Classics Journals</article-title>
          .
          <source>In: Journal of Computing and Cultural Heritage</source>
          , vol.
          <volume>5</volume>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>De</given-names>
            <surname>Wilde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hengchen</surname>
          </string-name>
          , lingual Archive with Linked content/uploads/2015/04/06.pdf .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>