Semantic Web for BIBLIMOS (position paper) B ÉATRICE B OUCHOU M ARKHOFF1 , S OPHIE C ARATINI2 , F RANCESCO C OREALE2 , M OHAMED L AMINE D IAKIT É3 and A DEL G HAMNIA1 1 Université François Rabelais Tours - Laboratoire d’Informatique LI (EA 6300) beatrice.bouchou@univ-tours.fr, adel.ghamnia@univ-tours.fr 2 Université François Rabelais Tours - Laboratoire CITERES (UMR 7324)- EMAM team sophie.caratini@univ-tours.fr, francesco.coreale@univ-tours.fr 3 Université des Sciences, de Technologie et de Médecine - DMI, Nouakchott, Mauritanie diakite@ustm.mr Abstract. We present the BIBLIMOS project, which aims to address the Western Saharan culture and history, by considering both local ancient Arabic manuscripts and European colonial archives. We describe the project’s context and objectives before focusing on ancient Mauritanian manuscripts, the content of which covers many scientific fields. We assess the current state of such ancient manuscripts’ digital processing and we analyse what the semantic web can bring for their use by scholars, from North and South: the ability for applications to operate jointly on several distributed and heterogeneous sources of digitized manuscripts and other kinds of archives, to support collaborative reflection. Keywords: Ancient Arabic Manuscripts; Data Integration; Semantic Virtual In- frastructure; Western Saharan Cultural Heritage 1 Introduction BIBLIMOS is a long standing programme, led by the CITERES laboratory4 , that pro- poses to collect information, and facilitate the constitution of thematic corpora, from public and private archives pertaining to the history of the Western Saharan region. Its first goal was to provide to local students and researchers the ability to study their history, through a digital remote access to original materials (through images and de- scriptions), and also the ability to collaborate more easily with foreign teams, on these materials. Moreover, in the long run, it is also planned to deal with both primary sources (original material created at the time under study) and secondary sources (material writ- ten by scholars). In parallel, it is intended to address colonial archives about this geo- graphical area, from European countries (mainly France and Spain), in order to cross complementary points of views, and thus, to discover new knowledge. Involving an international and cross-disciplinary team of researchers in the human- ities and, more recently, in computer science, BIBLIMOS aims to renew the knowledge and analysis of Western Sahara’s societies, by making available to researchers from the North and the South an open and interactive tool for searching and comparing local 4 http://international.univ-tours.fr/ centre-for-cities-territories-environment-and-societies-citeres--283347. kjsp?RH=INTER 49 archive funds, including the manuscripts of the desert, and European archives related to these regions. There is also an important multilingual challenge, as we plan to perform cross-referencing of Arabic, Pulaar, Soninke, Wolof, French, Spanish, Portuguese, Ital- ian, Dutch, German, English sources relating to the political, military, economic, legal, social, scientific and religious history of the territories of the Western Saharan region, from the modern era to the end of the Cold War. Concerning computer science, the BIBLIMOS programme is just getting started: it aims to create an e-infrastructure based on a network of information around the history of the Western Sahara. This open tool will offer (i) an access to sets of archival sources and original manuscripts, (ii) a guide to navigate this knowledge network, (iii) an auto- matic registration of new sources and (iv) new tools for knowledge creation and visual- izations. It will also be interfaced with various useful existing applications for research, such as electronic publishing platforms, collaborative editing tools, bibliography man- agement tools, etc. To achieve this goal, three lines of work have been initiated. First, to instigate, assist and sustain the creation of quality digital resources from the origi- nal sources, second, to develop partnerships with providers of already existing digital resources, and third, to incrementally build the target distributed e-infrastructure, in- cluding a web portal as mediator, relying on semantic web resources and technologies. In the first line of work, BIBLIMOS stakeholders in Social Sciences and Human- ities (SSH) are engaged in actions aimed at discovering new local sources and con- vincing their owners to join the programme. Concerning the second line of work, today manuscript sources concealed in the Western Sahara are already partly inventoried, and many European archive funds are now available to the public. As shown in Table 1, on the one hand, online digitized full-text manuscripts exist, duly indexed and catalogued, and on the other hand, institutions or associations offer to collaborate in order to index digitized materials from many sources (cf. last lines in Table 1). Clearly the Web, that provides information exploitable by humans, well supports all those very useful initia- tives. However, the query, the analysis, the combination and the overlapping of these multiple funds, still represents a major challenge for every interested person. This paper is dedicated to the third line of work in the BIBLIMOS programme, which addresses the field of the automatic data-processing of such sources, in order to better assist humans in these tasks. This is a field in which almost everything has to be designed and built. The Semantic Web, i.e., the web knowledge exploitable automatically by computers, is the way to cope with these challenges, as we argue in Section 3, after having presented the state of the art of digital processing of Ancient Arabic Manuscripts in Section 2. 2 Digital processing of Mauritanian Ancient Arabic Manuscripts 2.1 Mauritanian Ancient Arabic Manuscripts We focus on Mauritania’s manuscripts because Sophie Caratini, the instigator of the BIBLIMOS programme, is an anthropologist specialist of Mauritania and she built strong collaborations with scholars in Nouakchott, in particular through the IMRS5 . 5 Institut Mauritanien de la Recherche Scientifique, see http://www.imrs.mr/spip.php? page=sommaire_fr 50 Site Description University of Illinois, Urbana-Champaign. http: Online catalogue, references about 22500 //www.westafricanmanuscripts.org/ manuscripts from eleven different collections, including Northwestern Univ. Northwestern University, Chicago. Online http://digital.library.northwestern. catalogue, entries from four separate edu/arbmss/index.html collections. Library of Congress. Online catalogue, with http://memory.loc.gov/intldl/ access to images of 32 manuscripts from malihtml/malihome.html Timbuktu, Mali. French National Library (BnF). Online http://gallica.bnf.fr/ access to 35 manuscripts from Timbuktu, Mali. University of Cape Town. Tombouctou http://www.tombouctoumanuscripts.org Manuscripts Project; access to primary sources upon registration. Universities of Freiburg and Tübingen (Germany). Online images of approx. 2.500 http://omar.ub.uni-freiburg.de/ Arabic manuscripts (134.000 images) from Mauritania, with bibliographical metadata. Bibliotheca Alexandrina (Egypt). Online collection of Arabic manuscripts related to http://wamcp.bibalex.org/ classical medicine, around 1000 books and fragments. Qatar Digital Library (with the British Library). Archives, maps, manuscripts, sound http://www.qdl.qa/en recordings, photographs with explanatory notes and links, in both English and Arabic. IMRS (Mauritanian Islamic Republic). makrim.org Catalog of Mauritanian manuscripts, in both French and Arabic. http://www.islamicmanuscript.org/ The Islamic Manuscript Association extresources/manuscriptcatalogues. (Cambridge, stakeholders from 25 countries). aspx List of Islamic manuscripts catalogues. Open Library (world wide open access http://openlibrary.org/ project). List of resources on Arabic manuscripts (catalogues, books, etc.). The Internet Archive (USA non profit http://www.archive.org/ association). A search on Arabic manuscripts gives some digitized books. Table 1: Web sites about Western Saharan, or more generally, Arabic manuscripts. 51 Mauritania is known [. . . ] for its enormously rich heritage of Arab manuscripts, many brought from the Arab East by pilgrims returning from Makkah, some recopied from those imported sources by students in the Qur’an schools [. . . ], and others composed by Mauritania’s own jurists, poets and historians6 [16]. According to researchers, some Mauritanian manuscripts were written as early as in the 10th century, and their forms and subjects are very diverse, including law, science and religion. To have access to this legacy, the first step is to build up a precise survey of all manuscript repositories in existence in the territories of the Western Saharan region. This has been the goal of long term projects: for instance, the West African Arabic Manuscripts Database Project, from the University of Illinois at Urbana-Champaign, started in 1987, provides a catalogue (first line of Table 1) that references more than two thousand manuscripts. Currently, it references eleven collections, which still is far from representing the actual reality of family libraries. This is one of the web resources we plan to exploit in the BIBLIMOS programme, in parallel of completing the repositories survey work performed by the SSH teams. Several other websites provide information on Western Saharan or, more generally, on Arabic manuscripts: the list presented in Table 1 shows that there is al- ready a lot of knowledge available on the web, but this knowledge still is exploitable only through human labour. 2.2 Digital Processing of Ancient Arabic Manuscripts Concerning manuscripts, many different descriptions may be stored in computer memo- ries: (i) seeing the manuscript as an archaeological object, i.e. starting from its external aspect, a set of features may be evaluated, for instance the material it is made with, the colour of ink, etc. This is called codicology [4] and a well-established vocabulary for such a set of descriptors is provided by the IRHT7 ; (ii) a numerical image of the manuscript can be taken; (iii) a transcription of the manuscript’s textual content can be created, either manually or automatically from its numerical image (with OCR tools); (iv) both the image and the transcription may be annotated, this is the case for many Eu- ropean manuscripts, whose textual contents are encoded using the TEI standard; (v) the manuscript can be catalogued, i.e. classified and described by librarians or archivists, so it could be found again among collections: this supposes to define and identify de- scriptors, including the location, and some general information about the content. For each of these descriptions, active research is conducted and, in some cases, they converge to well established standards. Specifically for ancient Arabic manuscripts, in [15] the authors present the problem of cataloguing, assessing the difficulties involved in identifying the metadata used by different schools (those dealing with specimen and those addressing whole volumes). The solution proposed for enhancing interoperability is to rely on the DCMI8 vocabulary. The TEI9 , aimed at helping libraries, publishers, museums and universities to encode texts in order to facilitate information retrieval from 6 http://www.saudiaramcoworld.com/issue/200306/mauritania.s.manuscripts.htm 7 Institut de Recherche et d’Histoire de Textes, see http://codicologia.irht.cnrs.fr 8 World widely used, simple and generic, digitized resources’ description, see http:// dublincore.org/ 9 Text Encoding Initiative: http://www.tei-c.org/index.xml 52 textual contents, is another important medium for interoperability [14]. Nevertheless we cannot hope to use it in the short term because for now the only way to get transcriptions of Mauritanian manuscripts is to manually enter the text. Indeed, automatic character recognition algorithms hardly apply to these kinds of manuscripts, written with Arabic graphemes but very often actually in many other languages (e.g. Pulaar, Wolof, etc.). In [1], the authors recall the existing difficulties for applying OCR to ancient Arabic manuscripts and, although recent advances are reported in [3] and [11], they need to be further developed. Manuscript image analysis is not reduced to OCR: for instance, word spotting may be a useful alternative to character recognition. This is why several works propose to build ontological descriptions (or sets of metadata) of graphical image features, in order to index and retrieve manuscripts’ digital images on this descriptive basis [7, 6]. But to the best of our knowledge, such proposals have never been applied to ancient Arabic manuscripts. When it comes to ontological representation of ancient manuscripts, the work de- scribed in [10], about the SAWS10 project (Sharing Ancient WisdomS), is clearly an example of what we target in the BIBLIMOS framework. It deals with collections of moral and social advice and/or philosophical ideas from Greek and Arab wisdom liter- atures. Many of the concerned manuscripts have been transcribed and annotated using TEI, and an extension of the FRBRoo ontology [9] has been developed to describe the transmission of information (from one copyist to another and from one language to another). The authors extract the relationships defined in the ontology from the TEI annotations, to generate a conceptual network expressed in RDF11 . This network al- lows researchers to explore links between the different documents’ contents. This is an example of how semantic web technologies contribute to the building of new means of knowledge, by opening up and linking various sources for research which would otherwise remain isolated and unused. 3 Semantic Web Architecture for BIBLIMOS For humans, carrying out some scientific work by using the resources listed in Table 1 is still difficult, as there are no means to perform cross-references, comparisons, or to analyse the different points of view they provide, etc. Regarding BIBLIMOS’ aims, other kinds of sources than manuscripts (e.g. European archives) should also be ex- ploited, which increases again these difficulties. Fortunately, while the web allowed sources’ owners (or depositaries) to publish their resources through websites, the se- mantic web now supports the development of softwares that help humans to cope with these difficulties. Indeed, the semantic web is a network of semantic representations of web-published information that relies on the same technical principles as the websites’ network, but allows programs to operate on data at this semantic level. Main semantic web concepts are (i) web ontologies and (ii) linked (open) data; they provide a global space of interoperability, thus they are important components for BIBLIMOS’ aims. Figure 1 illustrates the intended general architecture for the BIBLIMOS programme. The novelties brought by the semantic web obviously start at the DATA level: to benefit 10 http://www.ancientwisdoms.ac.uk/ 11 Data model standard: http://www.w3.org/RDF/ 53 from these novelties, beyond all the work that has to be done to obtain results presented in the previous section, digital sources should also be pushed up to the semantic level. To this aim, the sources’ concepts and their relationships must be specified, from the bottom-up (starting from the source contents), top-down (from already well defined consensual ontologies), or both. The source’s content should be related to this concep- tual level, which may be done by using tools called Mapping Frameworks in Figure 1. Some of those tools propose to export the source data into a set of RDF triples (the stan- dard data warehouse approach in data integration systems), and some of them propose to access data through the conceptual level, based on the ontology-based data access (OBDA) principles [5] (the mediation approach, which is provided by, e.g., ontop12 ). Whatever the chosen approach, the source’s content is then searchable at the semantic level, with SPARQL. Those contents may be combined using reference thesauri and ontologies. WEB PORTALS Visualisations Navigations Preferences Interactive Analysis Collaborations Virtual Res. Inf. QUERIES Information Retrieval Entity Linking Data Mining Inferences Integration Mediation Aggregation SPARQL queriable THESAURUS REFERENCE SOURCES ONTOLOGIES MAPPING CONCEPTS FRAMEWORKS describing SOURCES SOURCES annotations images databases Fig. 1: Global BIBLIMOS’ Virtual Infrastructure. Querying the semantic web through its linked data sets is still in its infancy. Pub- lic well-established reference knowledge resources play the important role of hubs in this linked data network. The most visible are resources of facts, e.g. DBpedia, but at the conceptual level, reference domain ontologies also act as fundamental integration means. This is the case for CIDOC CRM [13] for cultural heritage, with its extension 12 http://ontop.inf.unibz.it/ 54 FRBRoo for libraries. These reference domain ontologies are the product of a long, in- ternational collaborative work, reflecting a consensus among the domain experts. These distributed and collaborative dimensions of the web are naturally inherited by the se- mantic web. In the context of BIBLIMOS, this is extremely powerful because these two features mirror the local structural organization of the Mauritanian family libraries, open to communities but distributed in the country rather than centralized in only one authoritative place. The semantic web resources also promote multilingualism, particularly in vocab- ulary resources such as thesaurus, as evidenced by multilingual ones, e.g. VIAF13 or RAMEAU14 , the French national library thesaurus now accessible on the semantic web (in SKOS), which is fully interlinked with a German (SWD) and an American (LCSH) thesaurus (thanks to the Multilingual ACcess to Subjects project). Above the DATA layer is the LOGIC layer, in which all the well-known successful inventions in the field of data operation (some of which are listed in Figure 1) may be revisited to take into account the semantic dimension of data. A corner stone for most of them is to access multiple sources conjointly, which supposes interoperability: one of the solutions provided by the semantic web is to align the local lightweight ontolo- gies that describe the sources’ content to the reference ontologies, allowing mediator systems to aggregate local data sets, for instance following the principles described in [12, 2]. Very active researches are conducted in the semantic web community to de- velop this LOGIC level, based on efforts to produce a strong semantic data layer. Lastly comes the PRESENTATION layer, whose innovative potential is also greatly boosted by the possibilities issued from the semantic web. 4 Conclusion We first drew a state of the art concerning the ways ancient Arabic manuscripts are pro- cessed and made available to the public nowadays, considering that the picture is not so different in the area of European archives (except that OCR tools are more usable). Once digitized, sources must be pushed up to the semantic level, for the query, the analysis, the combination and the intersection of these multiple funds to be supported by auto- matic data-processing of sources. We presented the semantic-web Virtual Infrastructure designed to cope with these challenges within the BIBLIMOS programme. We are aware that BIBLIMOS is a very ambitious programme - we are not aware of the existence of a similar enterprise anywhere else - as semantic web applications in this field are just beginning to emerge. For now, agreements are signed between our universities (Tours and Nouakchott), both in the computer science side and the social science side.AFD15 currently funds a training campaign for librarians of the IMRS16 on cataloguing documents, and the Mauritanian government is going to support all the needed local actions. Concerning the semantic web level, we are building an ontology 13 Virtual International name Authority File: http://viaf.org/ 14 http://data.bnf.fr/en/semanticweb 15 Agence Française de Développement: http://www.afd.fr/lang/en/home 16 Institut Mauritanien de recherches scientifiques: http://www.imrs.mr/spip.php?page= sommaire_fr 55 for the IMRS’ manuscripts [8], a part of which is already digitized, and we plan to work on designing and building an annotation tool based on this ontology. In order to include the European side (archives on these countries), we are thinking about a MSC Action (deadline in january, 2016). The campaign of partnerships with already existing materials is still to be done, as we must first build the semantic web tools that we should propose to them. References 1. Abdel Belaı̈d and Nazih Ouwayed. Segmentation of ancient arabic documents. Guide to OCR for Arabic Scripts, pages 2–16, 2011. 2. Beatrice Bouchou and Cheikh Niang. Semantic mediator querying. In International Database Engineering and Applications Symposium (IDEAS), pages 29–38. ACM, 2014. 3. W. Boussellaa, A. Zahour, H. El Abed, A. Benabdelhafid, and A. Alimi. Unsupervised block covering analysis for text-line segmentation of arabic ancient handwritten document images. In 20th International Conference on Pattern Recognition (ICPR), pages 1929–1932, 2010. 4. Stefanie Brinkmann and Beate Wiesmüller, editors. From Codicology to Technology: Islamic Manuscripts and Their Place in Scholarship. Frank and Timme GmbH, 2009. 5. Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, Riccardo Rosati, Marco Ruzzi, and Domenico Fabio Savo. The MASTRO system for ontology-based data access. Semantic Web, 2(1):43–53, 2011. 6. M. Coustaty, R. Pareti, N. Vincent, and J.M. Ogier. Towards historical document indexing: extraction of drop cap letters. IJDAR, 14(3):243–254, 2011. 7. B. Coüasnonet, J. Camillerapp, and I. Leplumey. Access by content to handwritten archive documents: Generic document recognition method and platform for annotations. IJDAR, 9(2):223–242, 2007. 8. Mohamed Lamine Diakité and Beatrice Bouchou Markhoff. OMOS: Ontology for Western Saharan Manuscripts. Technical Report 313, Université François Rabelais Tours, Laboratoire d’Informatique (available in HAL: https://hal.archives-ouvertes.fr/hal-01134010), 2015. 9. Martin Doerr and Patrick Le Boeuf. Modelling intellectual processes: The frbr - crm harmo- nization. In Digital Libraries: Research and Development, volume 4877 of Lecture Notes in Computer Science, pages 114–123. Springer, Berlin / Heidelberg, 2007. 10. A. Jordanous, K. F. Lawrence, M. Hedges, and C. Tupman. Exploring manuscripts: Sharing ancient wisdoms across the semantic web. In 2nd International Conference on Web Intelli- gence, Mining and Semantics (WIMS), pages 678–683. ACM, New York, 2012. 11. A. Khemiri, A. Kacem, and Belaid A. Towards arabic handwritten word recognition via probabilistic graphical models. In Frontiers in Handwriting Recognition (ICFHR), pages 678–683, 2014. 12. Cheikh Niang, Béatrice Bouchou, Yacine Sam, and Moussa Lo. A Semi-Automatic approach For Global-Schema Construction in Data Integration Systems. IJARAS, 4(2):35–53, 2013. 13. Dominic Oldman. The CIDOC Conceptual Reference Model (CIDOC-CRM): A Primer, Version 1. CIDOC CRM (http://www.cidoc-crm.org/docs/CRMPrimer v1.1.pdf), 2014. 14. Desmond Schmidt. Towards an interoperable digital scholarly edition. Journal of the Text Encoding Initiative [http://jtei.revues.org/979], 7, 2014. 15. M. O. Soulah and M. Hassoun. Which metadata for ancient arabic manuscripts catalogu- ing? In International Conference on Dublin Core and Metadata Applications, The Hague, Netherlands, 2011. 16. L. Werner. Mauritania’s manuscripts. Saudi Aramco World, 54(6):2–16, 2003. 56