<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Studying the History of Pre-Modern Zoology with Linked Data and Vocabularies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Molka Tounsi</string-name>
          <email>tounsi.molka@etu.unice.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Catherine Faron Zucker</string-name>
          <email>faron@unice.fr</email>
          <email>zucker@unice.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arnaud Zucker</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Serena Villata</string-name>
          <email>serena.villata@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Cabrio</string-name>
          <email>elena.cabrio@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Inria Sophia Antipolis Mediterranee</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ.</institution>
          <addr-line>Nice Sophia Antipolis</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>7</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>In this paper we rst present the international multidisciplinary research network Zoomathia, which aims the study of the transmission of zoological knowledge from Antiquity to Middle Ages through varied resources, and considers especially textual information, including compilation literature such as encyclopaedias. We then present a preliminary work in the context of Zoomathia consisting in (i) extracting pertinent knowledge from mediaeval texts using Natural Language Processing (NLP) methods, (ii) semantically enriching semi-structured zoological data and publishing it as an RDF dataset and its vocabulary, linked to other relevant Linked Data sources, and (iii) reasoning on this linked RDF data to help epistemologists, historians and philologists in their analysis of these ancient texts.</p>
      </abstract>
      <kwd-group>
        <kwd>History of Zoology</kwd>
        <kwd>Semantic Analysis of Mediaeval compilations</kwd>
        <kwd>Linked Data and Vocabularies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Scholars concerned with cultural issues in Antiquity or Middle Ages have to
deal with a huge documentation. The literary material is a signi cant part of
this material, but the commonly used technology supporting these researches is
to date far from satisfactory. In spite of pioneering undertakings in digitization
since the 70's, historians and philologists still have access to few tools to operate
on texts, mostly limited to lexical searches. Therefore they stand in need for
more intelligent tools, in order to overcome this word-dependency, to access the
semantics of texts and to achieve more elaborated investigations.</p>
      <p>
        The Semantic Web has an increasing role to play in this process of
providing new methodological implements in cultural studies. During the last decade,
several works addressed the semantic annotation and search in Cultural
Heritage collections and Digital Library systems. They focus on producing Cultural
Heritage RDF datasets [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ], aligning these data and their vocabularies on the
Linked Data cloud [
        <xref ref-type="bibr" rid="ref2 ref7">2, 7</xref>
        ], and exploring and searching among heterogenous
semantic data stores [
        <xref ref-type="bibr" rid="ref3 ref5 ref6 ref8">5, 8, 3, 6</xref>
        ].
      </p>
      <p>The international research network Zoomathia3 has been set up to address
this challenge in the area of History of Science. It aims to develop
interconnected researches on History of Zoology in pre-modern times and to raise
collaborative work involving philologists, historians, naturalists and researchers in
Knowledge Engineering and Semantic Web. In this context, we conducted a
preliminary work, presented in this paper, on the fourth book of the late
mediaeval encyclopaedia Hortus Sanitatis (15th century), which compiles ancient
texts on shes. Each chapter of this book is dedicated to one sh, with possible
references to other shes. In this work we aim at (i) automating information
extraction from these texts, such as zoonyms, zoological sub-discipline
(ethology, anatomy, medicinal properties, etc.); (ii) building an RDF dataset and its
vocabulary representing the extracted knowledge, and link them to the Linked
Data; and nally, at (iii) reasoning on this linked data to produce new expert
knowledge. We build upon the results of two previous French research projects
on structuring mediaeval encyclopaedias in XML according to the TEI model
and manualy annotating author sources (SourceEncyMe project4) and zoonyms
(Ichtya project5).</p>
      <p>The paper is organized as follows: Section 2 presents the general aim of
Zoomathia. Section 3 presents our work on knowledge extraction from the
mediaeval encyclopaedia Hortus Sanitatis, while Section 4 describes the publication
of a linked RDF dataset and its vocabularies. Section 4.3 presents preliminary
work on the exploitation of these data to support the study of the history of
pre-modern zoology, and Section 5 concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Zoomathia Research Network</title>
      <p>Zoomathia primarily focuses on the transmission of zoological knowledge from
Antiquity to Middle Ages. Manual search and computing on ancient and
mediaeval texts enable to address the quantitative dimension of data but fail to
answer the epistemological demands, which concern the scienti c relevancy and
the diachronic features of the documentation. A large range of investigations on
speci c topics is inaccessible through simple lexical queries and requires a rich,
scienti c and semantic annotation. When investigating, for example, on
ethological issues (such as animal breeding, intraspeci c communication or technical
skills) or on pharmaceutical properties of animal products, we have to face a
scattered documentation and a changing terminology hampering a direct access
to and a synthetic grasp of the topics studied. An automatized and
semanticbased process will help to link and cluster together the related data, compare
evidences in a diachronic approach and to gure out the major trends of the
cultural representations of animal life and behaviour.
3 http://www.cepam.cnrs.fr/zoomathia/
4 http://atelier-vincent-de-beauvais.irht.cnrs.fr/
encyclopedisme-medieval/programme-sourcencyme-corpus-et-sources-des-en
cyclopedies-medievales
5 http://www.unicaen.fr/recherche/mrsh/document_numerique/projets/ichtya</p>
      <p>In this network, we aim at both (i) identifying a corpus of zoology-related
historical data, in order to progressively encompass the whole known
documentation, and (ii) producing a common thesaurus operating on heterogeneous
resources (iconographic, archaeological and literary). This thesaurus should enable
to represent di erent kinds of knowledge: zoonyms; historical period;
geographical area; literary genre; economical context; zoological sub-discipline (ethology,
anatomy, physiology, psychology, animal breeding, etc.). The aim is to synthesize
the available cultural data on zoological matters and to crosscheck them with a
synchronic perspective. This would enable to reach the crucial concern, i.e. to
precisely assess the transmission of zoological knowledge along the period and
the evolution of the human-animal relations. Finally, this thesaurus should be
published on the Linked Data and linked to modern reference sources (biological
and ecological) to appraise the relevance of the historical documentation.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Knowledge Extraction from</title>
    </sec>
    <sec id="sec-4">
      <title>Historians and Texts</title>
      <sec id="sec-4-1">
        <title>Interviews of Historians</title>
        <p>We conducted several interviews with three Historians participating in Zoomathia
to explicit a list of major knowledge elements which would be useful in the study
of the transmission of ancient zoological knowledge in mediaeval texts. Among
them, let us cite the presence (or absence) of zoonyms in the corpus texts, variant
names or name alternatives given to an animal (polyonymy), the relative volume
of textual records devoted to a given zoonym, references to a zoonym and
frequency of occurrences related to it out of their dedicated chapter, geographical
location of the described animals, numerical data in the text (size, longevity,
fertility, etc.) and other animal properties related to zoological sub-disciplines
(ethology, anatomy, physiology, psychology, animal breeding, etc.).
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Extraction of Zoonyms and Animal Properties from Texts</title>
        <p>We processed two versions of book 4 of Hortus Sanitatis, the original Latin
text and its translation in French. We used the XML structured version of these
texts, identifying the 106 chapters of the book, divided in paragraphs, themselves
including citations. We used TreeTagger to parse Latin and French texts and
determine the lemmas and part of speech (PoS) of each word in the text. We
searched for the resources available to support the knowledge extraction process.
A lexicon of sh names in French and in Latin has been provided by the Ichtya
project and we | Knowledge Engineers and Historians | collaboratively built a
thesaurus of zoological sub-disciplines and concepts involved in the descriptions
relative to these sub-disciplines. Then we de ned two sets of syntactic rules for
French and Latin to recognize zoonyms from the lexicon of sh names among
the lemmas identi ed in the texts. For instance one of the rules to recognize that
a Latin text deals with longevity is the occurence of the verb vivere followed by
a numeric value followed by the noun annis (ablative plural of annus).</p>
        <p>We conducted a similar processing of the same two texts to extract zoological
sub-disciplines and animal properties. We de ned two sets of syntactic rules to
extract this information from the Latin and French text (39 rules for French and
10 rules for Latin). For instance the Latin verbs curare (heal) or sanare (cure)
with an animal name as subject are used to identify the therapeutic topic; the
verbs comedere or pascere or deglutire (eat) are used to identify the diet topic.
Evaluation The analysis of the results of the automatic annotation process
was conducted by knowledge engineers and validated by philologists involved
in the manual annotation. For the evaluation of the extraction of zoonyms we
considered chapters 1 to 53 of book 4 of Hortus Sanitatis. We compared the
results of the automatic annotation with those of the manual annotation of
zoonyms conducted within the past Ichtya project. F-measure equals to 0.93
for both the annotation of the Latin text and the French text. Most missing
annotations are due to the fact that the parsing tool is unable to deduce the exact
lemma of some words, especially for Latin words. Among 65 missing annotations,
51 (rare) sh names were not annotated because TreeTagger does not recognize
them (e.g., loligo). Other missing annotations concern composed names and are
due to a mismatch between the complete sh name in the reference lexicon and
the short name used in the text to be annotated (e.g. locusta instead of locusta
marina). Conversely, most annotation errors are due to ambiguities between
marine animal names and terrestrial animals. For instance, lemma lupus (wolf)
is present in the provided lexicon of sh names (wol sh ) and there are some
comparisons in the text with the (terrestrial) wolf6.</p>
        <p>For the evaluation of the automatic extraction of animal properties, we
manually annotated the 25 rst chapters of Hortus Sanitatis to use it as a reference
version. F-measure is above 0.7 for both the annotation of the Latin text and the
French text. Most wrong annotations are related to anatomy. These annotations
are due to a confusion between human and animal anatomical parts appearing
in the text, when the text deals with the therapeutic power of some animal on
a human organ. For instance, the detection of lemma dentes (tooth) in the text
leads to the annotation of the text with the anatomy topic, whereas, in some
cases, the text describes a therapeutic e ect of the animal on (human) teeth7.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>From Unstructured Data to Semantic Data</title>
      <p>The extracted knowledge has rst been used to enrich the available XML
annotation of Hortus Sanitatis. Then we translated the whole XML annotation (text
structure, source authors, zoonyms and animal properties) into an RDF dataset
and vocabularies and exploited it with SPARQL queries.
6 \And although this is the case for all shes, it is however more obvious in him
(wol sh), as it is also for the wolf and the dog among the beasts"
7 \[Human] teeth are cleaned using conch shell ash."
4.1</p>
      <sec id="sec-5-1">
        <title>RDF Dataset</title>
        <p>An RDF dataset describing Hortus Sanitatis has been automatically generated
by writing an XSL stylesheet to be applied to its XML annotation. Listing 1.1
presents an extract of it describing quotation 4 of paragraph 3 of chapter 20.
It is a citation of Aristotle, refering to the crocodile zoonym and addressing the
therapeutics and anatomy topics.
Based on the lexicon initially provided by Historians involved in the Ichtya
project, we built a SKOS thesaurus for zoonyms and we aligned it with both
the cross-domain DBpedia ontology and the Agrovoc thesaurus specialized for
Food and Agriculture8. In a near future we intend to align it with the TAXREF
taxonomy specialized in Conservation Biology and integrating Archaeozoological
data9. Listing 1.2 presents an extract of the thesaurus describing taxon Gar sh.
&lt; http :// zoomathia . org / Orphie &gt; a skos : Concept ;
skos : prefLabel " orphie " @fr ;
skos : closeMatch &lt; http :// fr . dbpedia . org / resource / Orphie &gt; ;
skos : closeMatch &lt; http :// dbpedia . org / resource / Garfish &gt; ;
skos : closeMatch &lt; http :// aims . fao . org / aos / agrovoc / c_5102 &gt; ;
skos : altLabel " gwich " .</p>
        <p>Listing 1.2. Extract of the Zoomathia thesaurus of zoonyms</p>
        <p>We built an RDFS ontology of zoology-related sub-disciplines and animal
properties, based on the results of interviews with Historians and the properties
extracted from texts. This is a preliminary modelisation which has to be further
developed.
8
http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agriculturalthesaurus
9 http://inpn.mnhn.fr/programme/referentiel-taxonomique-taxref?lg=en
In order to exploit the extracted RDF knowledge base, we built a set of SPARQL
queries enabling to answer questions such as \What are the zoonyms studied in
this text?", \What are the topics covered in this text?", \Where can we nd
these topics?",\What are the zoonym properties (in which chapter or paragraph
or citation)?". Let us note that it is the semantics captured in the constructed
vocabularies which make it possible to answer these queries: multiple labels
associated with a taxon in the thesaurus of zoonyms, hierarchy of zoology-related
sub-disciplines, denoted by various terms.</p>
        <p>We went a step further in the exploitation of the RDF dataset by writing
SPARQL queries of the construct form to construct new RDF graphs
capturing synthetic knowledge. When graphically visualized, they support the
analytical reasoning of historians on texts. For instance, Figure 1 presents the RDF
graph capturing the relative importance of zoology-related sub-disciplines in the
Hortus Sanitatis and their location in it. At a glance, it shows that anatomy
occupies a predominant place in this text, far ahead of therapeutics and shing.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>We presented a preliminary work conducted in the context of the Zoomathia
network, on the zoological mediaeval encyclopaedia Hortus Sanitatis. This work
combines NLP techniques to extract knowledge from texts, and knowledge
engineering and semantic web methods to build a linked RDF dataset of zoological
annotations of this scienti c text. It exploits this dataset to support the analysis
of the Ancient zoological knowledge compiled in the encyclopaedia.</p>
      <p>The next step will be to apply the presented process on a classical Latin
book on shes (Pliny, Historia Naturalis, book 9, 1st century AD), which is a
major, though indirect, source of the Hortus Sanitatis, to deal with the historical
perspective of zoology, and end up with comparing the data of the two selected
works, to appraise the density of the transmission and the evolution of the
zoological knowledge on an epistemological point of view. We intend to systematically
compare the two texts, with the aim of evaluating the loss, distortion or
enrichment of information, and comparing the relative importance in the books of the
di erent zoological perspectives (anatomical, ethological, geographical, etc.) and
of the di erent animal species.</p>
      <p>Acknowledgments. Zoomathia is an International Research Group (GDRI)
supported by the French National Scienti c Research Center (CNRS).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. V. de Boer,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wielemaker</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. van Gent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hildebrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Isaac</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. van Ossenbruggen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Schreiber</surname>
          </string-name>
          .
          <source>Supporting Linked Data Production for Cultural Heritage Institutes: The Amsterdam Museum Case Study. In 9th Extended Semantic Web Conference, ESWC</source>
          <year>2012</year>
          , Heraklion, Crete, Greece,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. V. de Boer,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wielemaker</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. van Gent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oosterbroek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hildebrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Isaac</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. van Ossenbruggen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Schreiber</surname>
          </string-name>
          . Amsterdam Museum Linked Open Data.
          <source>Semantic Web</source>
          ,
          <volume>4</volume>
          (
          <issue>3</issue>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Dijkshoorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Aroyo</surname>
          </string-name>
          , G. Schreiber,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wielemaker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Jongma</surname>
          </string-name>
          .
          <article-title>Using Linked Data to Diversify Search Results: a Case Study in Cultural Heritage</article-title>
          .
          <source>In 19th International Conference on Knowledge Engineering and Knowledge Management</source>
          ,
          <string-name>
            <surname>EKAW</surname>
          </string-name>
          <year>2014</year>
          , Linkoping, Sweden,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>T.</given-names>
            <surname>Elliott</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Gillies</surname>
          </string-name>
          .
          <article-title>Digital geography and classics</article-title>
          .
          <source>Digital Humanities Quarterly</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hildebrand</surname>
          </string-name>
          .
          <article-title>Interactive Exploration of Heterogeneous Cultural Heritage Collections</article-title>
          .
          <source>In 7th International Semantic Web Conference, ISWC</source>
          <year>2008</year>
          , Karlsruhe, Germany,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>L.</given-names>
            <surname>Isaksen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. T. E.</given-names>
            <surname>Barker</surname>
          </string-name>
          , and P.
          <string-name>
            <surname>de Soto</surname>
          </string-name>
          <article-title>Can~amares. Pelagios and the emerging graph of ancient world data</article-title>
          .
          <source>In ACM Web Science Conference</source>
          , WebSci '14,
          <string-name>
            <surname>Bloomington</surname>
          </string-name>
          , IN, USA, pages
          <volume>197</volume>
          {
          <fpage>201</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Antonioletti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Hume</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Blanke</surname>
          </string-name>
          , G. Bodard,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hedges</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajbhandari</surname>
          </string-name>
          .
          <article-title>Building bridges between islands of data - an investigation into distributed data management in the humanities</article-title>
          .
          <source>In Fifth International Conference on e-Science, e-Science</source>
          <year>2009</year>
          , Oxford, UK, pages
          <volume>33</volume>
          {
          <fpage>39</fpage>
          . IEEE Computer Society,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>G.</given-names>
            <surname>Schreiber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Amin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Aroyo</surname>
          </string-name>
          , M. van
          <string-name>
            <surname>Assem</surname>
            , V. de Boer, L. Hardman,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hildebrand</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Omelayenko</surname>
            ,
            <given-names>J. van Ossenbruggen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tordai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wielemaker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Wielinga</surname>
          </string-name>
          . Semantic Annotation and Search of Cultural-Heritage
          <string-name>
            <surname>Collections: The MultimediaN E-Culture</surname>
            <given-names>Demonstrator. J. Web</given-names>
          </string-name>
          <string-name>
            <surname>Sem</surname>
          </string-name>
          .,
          <volume>6</volume>
          (
          <issue>4</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>