Digital repertoires of poetry metrics: towards a Linked Open Data ecosystem Mariana Curado Malta12 , Elena González-Blanco1 , Clara Martínez1 , and Gimena del Rio3 1 LINHD-UNED, Madrid, Spain mariana.malta@linhd.uned.es,{egonzalezblanco,cimartinez}@flog.uned.es http://linhd.uned.es 2 CEOS.PP, Polytechnic of Oporto, Portugal mariana@iscap.ipp.pt http://www.iscap.ipp.pt 3 CONICET - IIBICRIT, Buenos Aires, Argentina gdelrio.riande@gmail.com http://www.conicet.gov.ar Abstract. This paper presents work-in-progress of the POSTDATA project. This project aims to provide means to solve the interoperability issues that exist among the digital poetry repertoires. These repertoires hold data of poetry metrics that is locked in their own databases and it is not freely available to be compared and to be used by intelligent ma- chines that could infer over the data. The POSTDATA project will use Linked Open Data (LOD) technologies to overcome the interoperabil- ity problems. POSTDATA is developing a metadata application prole (MAP) for the digital poetry repertoires, a construct that enhances in- teroperability. This development follows the method for the development of MAP (Me4MAP). A MAP for the digital poetry repertoires will open doors for these repertoires to be able to structure the data with a common model in order to publish it as Linked Open Data. This paper presents how this MAP is being developed so far. Keywords: Digital humanities, Linked Open Data, interoperability, meta- data application prole, poetry, digital repertoires 1 Introduction This paper presents the metadata application prole (MAP) that is being created for digital poetry collections or repertoires. Poetry is a cultural product that uses language focusing in their sounds and rhythms, trying to make every word count as something experienced meaningfully through the body at the same time as it is understood by the mind[1, p. 1]. Following [2, p. 132] meter is dened as a systematic literary convention whereby certain aspects of the phonology are organized for aesthetic purposes". In this sense, versication is an abstraction of linguistic phenomena in which words (in their formal and semantic aspect) relate to rhythm and rhyme for artistic purposes. Although many theories about 2 Curado Malta et al. versication and metrics have been developed for the dierent languages and traditions, the POSTDATA project is interested in this structural and a formal approach to look at poetry into discrete units, categories, and their relationships. That's why one of its main interests is the analysis of metrical repertoires in digital form. A digital repertoire of poetry metrics is a catalogue that gives account of the metrical and rhythmical schemes of either a poetic tradition, a period or school, gathering a long corpus of poems that are dened and classied by their main characteristics. This kind of repertoires may contain the text of the poem and information related to authors, manuscripts, editions, music, and other features, all of them related to the poems. In the beginning, repertoires were printed books in which we could nd information listed in a way similar to an address book. The digital era changed the way in which information is displayed allowing the user to perform complex and multiple searches. In all these cases there is an ontological leap when the data is put in digital format4 . The lack of interoperability between the dierent digital repertoires dealing with poetry metrics across the dierent languages, literatures and traditions is a problem that needs to be addressed [4]. POSTDATA is a project nanced by a Starting Grant of the European Research Council5 that aims to solve this problem. The reason for this absence of interoperability is twofold: 1) there is a lack of standardization in the philological eld due to the independent evolution of each dierent cultural tradition; 2) the technological solutions used for building each poetic digital repertoire or database are very dierent, and tailored following a dierent model without taking into account, in most of the cases, the standards used in Digital Humanities. The basis of POSTDATA is building of a semantic system which will serve as bridge to mind the gap between the technological and philological worlds. It aims to develop a metadata application prole that will give a semantic model for all the existing poetic digital repertoires that are currently available on the Web of Documents6 . With this common model all these repertoires will be able to publish its data as Linked Open Data and become interoperable among them. The goal of this paper is to present how POSTDATA addresses the interop- erability problem among the Digital Poetry Repertoires. This paper proceeds as follows: Section 2 presents briey the POSTDATA project and the quest for interoperability of the Digital Repertoires of poetry metrics; section 3 presents the metadata application prole (MAP) construct as a way to achieve interoperability, and a method to develop MAPs; section 4 4 For the concept of metrical repertoire and their history and evolution see [3] 5 ERC-2015-STG-679528 6 Web of Documents is a term used in contrast with the term Web of Data. The Web of Documents is made of documents read by human beings that navigate between documents located in servers through hyper-links, it is the Web that everyone uses in a daily basis. The Web of Data or Linked Data or even the Semantic Web, three ways of expressing similar concepts, have technologies that enable people to create data stores on the Web, build vocabularies, and write rules for handling data [7] Digital repertoires of poetry metrics 3 reports on the rst steps of the development of the MAP for European poetry. The last section presents conclusions and future work. 2 POSTDATA POSTDATA aims at shortening the digital gap among poetry and technology, looking for interoperability solutions. This project has several dimensions as we can see in FIG.1. Fig. 1. POSTDATA project explanation schema. It aims at building a digital research environment to create poetry collections and repertoires as well as poetic library of treatises, where users can consume information or contribute for the corpora of the library uploading their texts and analysis. Users will be able to use the service of Exploration & Discovery Visual Tools to visualize syntactical structures, perform word frequency analysis and textual patterns in poems in order to reect metrical and rhythmical varieties. This visualization will use automated methods for poetry analysis combined with other technologies such as Natural Language Processing or Computational stylistics, combined with TEI-XML Encoding7 . POSTDATA will develop tools 7 See http://www.tei-c.org - retrieved October 11, 2016 4 Curado Malta et al. to apply to the rst level of poem analysis Natural Language Processing algo- rithms, such as Name-Entity Recognition systems to extract information, clas- sify elements in text into pre-dened categories such as the names of persons, organizations, locations, and later revision of the results such as corrections and additions of information will also be possible. POSTDATA will also develop tools to perform statistical analysis of the poem, or of the corpora, to provide nal users with relevant information. These analysis will be feed by both the data of the local repertoire as well as the data available in the Digital Poetry LOD. There is already a very relevant set of digital poetic repertoires on the Web of Documents; there is also a certain number of local databases. All this re- sources constitute a rich kaleidoscope of multilingual virtual poetry. As examples we can refer repertoires in French: French lyrical collections (Nouveau Naete- bus)8 , in Italian: Bibliograa Elettronica dei Trovatori (BedT)9 , in Hungarian: The Répertoire de la poésie hongroise ancienne (RPHA)10 , in Ancient Latin: The Corpus Rhythmorum Musicum11 , in Galaico-Portuguese: The Cantigas de Santa María12 , in Castellano: The Repertorio Métrico Digital de la Poesía Me- dieval Castellana (ReMetCa)13 , in Dutch: Dutch Song Database14 , in Occitane: Occitaine Répertoire métrique de la poésie lyrique occitane des troubadours à leurs héritiers15 , in Catalan: Repertori d'obres en vers16 , in Skaldic: The Skaldic Project17 , in German: The Lyrik des Minnesänger18 and in English: the Digital Edition of the index of Middle English Verse19 , among many others. It is not in the aim of this paper to present all the repertoires, but only to show how alive the Digital Humanities Community of Poetry (DHCP) is, and how diverse and immense is the DHCP information available on the Web of Documents[5]. This data is at the moment locked in the silos of information of each repertoire, not available freely to be compared and to be used by intelligent machines that could infer many things over the data. All these repertoires face a challenge of interoperability. POSTDATA ad- dresses this issue by using LOD technologies [6]. It will add a semantic layer to its repertoire (the set of all poetry collections) in order to be able to publish Po- etic related data as Linked Open Data, and be interoperable with other entities that may do the same. POSTDATA will also provide a SPARQL endpoint for its dataset. 8 http://nouveaunaetebus.elte.hu  retrieved September 27, 2016 9 http://www.bedt.it/BEdT_04_25/inf_home_crediti.aspx  retrieved September 27, 2016 10 http://rpha.elte.hu/  retrieved September 27, 2016 11 http://www.corimu.unisi.it  retrieved September 27, 2016 12 http://csm.mml.ox.ac.uk/  retrieved September 27, 2016 13 http://www.remetca.uned.es  retrieved September 27, 2016 14 http://www.liederenbank.nl/  retrieved September 27, 2016 15 http://icalia.es/troubadours/ca/  retrieved September 27, 2016 16 Local database 17 http://www.abdn.ac.uk/skaldic  retrieved September 27, 2016 18 http://www.lhm-online.de  retrieved September 27, 2016 19 http://dimev.net  retrieved September 27, 2016 Digital repertoires of poetry metrics 5 POSTDATA will not achieve anything without the contribution of the DHCP. The repertoires of this community have data that is trapped in the Web of Documents, and needs to be released, i.e., this data needs to be published as LOD. Making poetry available on line as machine-readable data will open a world of possibilities of linking, indexing and extracting new information through the combination of the dierent datasets. In order for this data to be interoperable POSTDATA needs to build a common model to structure DHCP data all in the same way. This common model is in fact a metadata application prole (MAP), a construct that enhances interoperability [8]. 3 Development of Metadata Application Proles A prole is a term used to refer to a document that shows how standards and specications can be used to deploy a particular application. A metadata appli- cation prole is a construct that when used by a certain community enhances in- teroperability [9]. The Dublin Core Metadata Initiative (DCMI)20 , a well-known and inuential global initiative concerned with metadata, dened the rules to build a MAP in a recommendation called The Singapore Framework for Dublin Core Application Proles (see [9]). This recommendation says that a MAP is composed by:  functional requirements,  domain model,  description set prole,  usage guidelines (optional),  syntax guidelines (optional). The functional requirements state what kind of things the community of practice wants to do with the data. The domain model presents a way to model the concepts (abstract and not abstract) and respective properties that data represents. A MAP targets a community, meaning that all the dierent members of that community must feel represented in the domain model described by that MAP. This representativity has to do with the fact that each member of the community must be able to describe its resources using the MAP dened by the community. If the MAP fails to serve a specic member of the community, this member will be excluded in the sense that its data will not be interoperable with the rest of the community of practice. If this exclusion happens it might mean that the MAP was not very well developed because it does not respond to the needs of all members of the community that integrated the development. In LOD a certain community of practice served by a MAP might have other type of communities of practice that live in the boundaries of the community of practice the MAP serves. Both the boundary community and the community of practice might be interested in sharing part of the data, that is, might want 20 http://dublincore.org - retrieved October 6, 2016 6 Curado Malta et al. to have a certain level of interoperability between them. During the MAP de- velopment process developers should be aware of these boundary communities and try to integrate, when possible, part of their characteristics. LOD is a wide and open ecosystem. The more boundary communities are touched, the more probable is that the data is used. The development of a MAP is though a crucial task for a community of practice. This development should be structured and integrate, since the early phases of development, elements of all representative members of the community of practice. The DHCP organizations dier in organization-type, location, culture and in the language they speak. To nd a common ground of understanding in such an environment becomes a huge challenge. This circumstance is not new for a MAP development. In fact such a development is often done in complex settings that are very open, in contrast with the development of software that serves a certain organization that is protected inside its walls of context, culture and language, where requirements can be elicitated using very well known techniques. In a MAP development, developers will never know in fact the total reach of the MAP, the community of practice that the MAP serves can be very well dened but there will be always a degree of uncertainty - to elicitate requirements is not easy in such uncertainty. The authors think that the existence of a method for the development of a MAP may help to address all the referred challenges. Recent studies say that there is no method for the development of MAPs (see [13]), in order to address this issue [5,6] have been working on the denition of a method for the development of metadata application proles (Me4MAP). POSTDATA is using Me4MAP21 to develop the MAP-EP. 4 MAP-EP Development process The development of the MAP-EP faces the challenge to serve at least 14 reper- toires that are presently active in the Web of Documents22 . There are other initiatives that make part also of the community of practice, but are not core community. The poetic repertoires can be dened as the core community of DHCP, other initiatives such as the LOD project of Biblioteca Nacional de España 23 , the data project of Museo del Prado 24 , Pelagios25 , Biblissima26 , 21 Only draft versions are published so far. The rst version of Me4MAP was submitted to an international research journal and is waiting for approval. POSTDATA team is using this rst version of Me4MAP not yet published. 22 This number is changing at the moment of writing this chapter since the project is a work-in-project 23 See http://www.datos.bne.es - retrieved October 7, 2016 24 https://www.museodelprado.es/modelo-semantico-digital/ el-prado-en-la-web - retrieved October 11, 2016 25 http://commons.pelagios.org/ - retrieved October 8, 2016 26 http://www.biblissima-condorcet.fr - retrieved October 8, 2016 Digital repertoires of poetry metrics 7 Claros27 , among others, are the boundary communities as previously called. These projects do not deal with poetry but with information concerning biblio- graphic records, arts in general, and geographical places and persons connected to the resources described (manuscripts, pieces of art, objects in general). POST- DATA also wants to have a certain degree of interoperability with these initia- tives. The Vision Statement of the MAP-EP should clearly state what is the core domain and should also open doors to other boundary domains. POST- DATA Vision statement is still being dened. As dened in Me4MAP the rst activity is the rst Singapore Stage (S1) which develops the Functional Requirements. According to Me4MAP the functional requirements can be elicitated using the technique of developing uses-cases. The development of POSTDATA use case model is build with the study of the: (i) functionalities of the digital repertoires that are on the Web of Documents; (ii) local repertoires that are being build by researchers, at the same time as the project is being developed, and that want to use POSTDATA tools to be able to share and use data. So far there are two of such local repertoires working with POSTDATA. POSTDATA will also implement a survey to end users of the repertoires in order to understand what kind of things such users would like to do with the data. This survey will run on line. All POSTDATA partners (responsible of the repertoires) will help POSTDATA to disseminate the survey. The answers will be analyzed and a set of functionalities dened. From all this work POSTDATA team will dene a use case model that will explicit the Functional Requirements. POSTDATA team is also already collecting information about the data mod- els of the databases, that together with the functional requirements, will be used to dene the Domain Model, the second Singapore Stage (S2) (the second ac- tivity dened by Me4MAP). This information is being collected, organized and analised. POSTDATA team contacted all the responsible of the repertoires in order to obtain documentation of the databases. To communicate with some of the responsible is not easy since many of them are not database experts so do not speak the same language. This results in information that is not under- standable or that it is not enough to get a data model. Many information is re-created with the help of philologists of the team, they analyse the Websites and their functionalities in order to understand the meaning of some elds. From the 21 repertoires we have collected so far information from 15 (see Table 1). When dening the Domain Model, it will very important to be aware of stan- dard conceptual models that exist in the same community of practice. POST- DATA team has in mind to study the FRBRoo28 with the aim to integrate it in the domain model since it has become a very important conceptual model in the Galleries, Libraries, Archives and Museums (GLAM) community. FRBRoo 27 http://www.clarosnet.org/ - retrieved October 8, 2016 28 See http://archive.ifla.org/VII/s13/wgfrbr/FRBRoo_V9.1_PR.pdf - retrieved October 10, 2016 8 Curado Malta et al. Table 1. Type of information sent by the responsible of the repertoires Type How many Observations MySQL dump script 5 Able to open in phpMyAdmin and able to analyse the Logical Model. MWD le 2 Able to open with MySQL Workbench and able to analyse the Logical Model. XML data les 2 Able to load the les to a XML parser and able to extract the XML Schema. XML dtd le 1 Able to extract the XML Schema using a XML parser. Perl script with data 1 Able to open the le with a plain text ed- itor and able to analyse the le. Excel le with data 1 Able to open the le with OpenOce soft- ware and able to analyse the tables on the le. Documentation 3 Pdf les with text explaining the tables and elds, some with ER diagrams of the database. Able to analise the pdf le - no possiblities to check inconsistencies. is in fact an object-oriented formulation of the FRBR model29 as an extension of CIDOC CRM30 . TEI, the Text Encoded Initiative31 that has a module for the description of poetry related resources, is a data model that should be taken in account. This data model is not yet deployed in the Semantic Web, and it is widely used by the DHPC (using XML related technologies). Me4MAP denes another activity - to be developed in parallel with S2 - called Environmental Scan. According to Me4MAP, an Environmental Scan is a report that contains a review of the metadata schemas that are available in any serialization of the Semantic Web (e.g. RDF/XML, turtle, etc.) and that may serve the needs of the Domain Model The POSTDATA team is aware of the importance of using standard or/and the most used RDF vocabularies to achieve good levels of interoperability with other communities of practice. The study of these vocabularies is done in the Environmental Scan. The development of the Environmental Scan of MAP-EP has already started but is still in the very beginnings of development. Nevertheless POSTDATA team has the following considerations:  standards should be the most used, so dcterms32 will be always a rst choice to terms and classes 29 See http://www.ifla.org/publications/functional-requirements-for-bibliographic-records - accessed October 10, 2016 30 See http://www.cidoc-crm.org - accessed October 10, 2016 31 See http://www.tei.org - accessed October 12, 2016 32 See http://dublincore.org/documents/dcmi-terms/ - retrieved October 10, 2016 Digital repertoires of poetry metrics 9  Digital Manuscripts to Europeana (DM2E)33 is a very important initiative that will be used to describe concepts related to manuscripts;  the BIB FRAME vocabulary34 and BIBO35 ontology are also strong candi- dates to describe bibliographic records of the POSTDATA domain model. Since the domain of MAP-EP has names of persons and locations related to the bibliographic records, authority repertoires such as the geonames ontology36 , DBpedia37 and VIAF directory38 are planned to be used. 5 Conclusions and Future Work This paper presents preliminary work of a research project nanced by the Eu- ropean Research Council (ERC). This project (POSTDATA) wants to solve the interoperability problems that exist among the digital poetry repertoires. These repertoires are present in the Web of Documents or are local les holding data of poetry metrics, that is, information about poetry analysis. This data is trapped in every database, structured in many dierent ways, and it is not shared among repositories. The aim of POSTDATA is to liberate this data in a way that it can be shared and open, in order to be used by intelligent machines that can compare the data and infer over it arriving to new dimensions of knowledge. The solution to solve the interoperability issue referred is to use Linked Open Data technologies and to publish the data as LOD. The data needs to be structured with a common model, that is, a metadata application prole (MAP), a con- struct that enhances interoperability. POSTDATA is using Me4MAP, a method for the development of application proles do develop a MAP for European Poetry (MAP-EP). This paper presents the way MAP-EP is being developed, showing how POSTDATA team is using: 1) the Websites and the logical models of the repertoires; 2) use-cases of work of researchers that are collecting poetry data and discussing with the POSTDATA team the things they want to do with the data and 3) a survey to nal users of the existent repertoires asking about the things they would like to do with the data; to dene the functional require- ments and the domain model of the MAP. At the same time POSTDATA team is already developing the Environmental Scan, a report that states all the RDF vocabularies that may serve the domain model. At the end of the project all repertoires will be able to map its relational models with the MAP-EP. And will be ready to publish data in LOD. As future work the POSTDATA team has to follow the path dened by Me4MAP to develop MAP-EP. According to the plan, a rst version of MAP- EP will be ready by the end of 2017. 33 See http://dm2e.eu/ - retrieved October 8, 2016 34 See https://www.loc.gov/bibframe/ - retrieved October 17, 2016 35 See http://bibliontology.com/ - retrieved October 8, 2016 36 See http://www.geonames.org/ontology/ - retrieved October 10, 2016 37 http://dbpedia.org - retrieved October 12, 2016 38 http://viaf.org - retrieved October 12, 2016 10 Curado Malta et al. During this development process a research team will be interested in moni- toring the use of Me4MAP in order to validate it. Me4MAP was developed using a Design Science Research Methodology (see [15]). The evaluation of the use of Me4MAP will inform the construction moments of DSR in order to create a revised version of Me4MAP. Acknowledgments. The authors would like to thank all those in charge of the repertoires for their help sharing information and discussing database issues with the POSTDATA team. Mariana Curado Malta thanks ISCAP.IPP for her 3 year's leave which opened the possibility to work in POSTDATA, a wonderful and challenging professional experience in Madrid. This paper has been developed thanks to the research projects funded by MINECO and led by Elena González-Blanco: Acción Europa Investiga EUIN2013-50630: Repertorio Digital de Poesía Europea (DIREPO) and FFI2014-57961-R. Labo- ratorio de Innovación en Humanidades Digitales: Edición Digital, Datos Enlaza- dos y Entorno Virtual de Investigación para el trabajo en humanidades, and the Starting Grant research project: Poetry Standardization and Linked Open Data: POSTDATA (ERC-2015-STG-679528), funded by European Research Council (ERC) under the European Unions Horizon2020 research and innovation pro- gramme, (http://postdata.linhd.es/). References 1. Attridge, D.: Poetic rhythm: an introduction. Cambridge University Press (1995) 2. Chatman, S: Comparing Metrical Styles, pp. 132-155. Essays in the Language of Literature. Eds Chatman, S. and Levin, S. R. Boston (1967) 3. González-Blanco García, E., Martínez Cantón, C. I., Martos Pérez, M. D., and Del Río Riande, M. G. D.: Una propuesta de integración del sistema de formularios de bases de datos MYSQL con etiquetado TEI: ReMetCa, Repertorio digital de la métrica medieval castellana, pp. 209-219. Humanidades Digitales: desafíos, logros y perspectivas de futuro, eds. López Poza, S. and Pena Sueiro,N. Annex to Janus, http://hdl.handle.net/2183/13587 - accessed December 8, 2016 (2014) 4. González-Blanco, E., Del Rio Riande, G., Martinez Cantón, C.: DH Poetry Mod- elling: a Quest for Philological and Technical Standardization, pp. 526-528. Pro- ceedings of the Conference in Digital Humanities 2016 - Conference Abstracts. http://dh2016.adho.org/abstracts/73 - accessed December 8, 2016 (2016) 5. González-Blanco García, E. and Seláf, L.: Megarep: A comprehensive research tool in medieval and renaissance poetic and metrical repertoires, pp. 321-332. Humanitats a la xarxa: món medieval / Humanities on the web: the medieval world. Eds. Soriano, L., Coderch, M., Rovira, H., Sabaté, G. and Espluga, X. Oxford, Bern, Berlin, Bruxelles, Frankfurt am Main, New York, Wien: Peter Lang (2014) 6. González-Blanco García, E., Del Río Riande, G., and Martínez Cantón, C. I.: Linked open data to represent multilingual poetry collections. A proposal to solve interoper- ability issues between poetic repertoires. Proceedings of the 5th Workshop on Linked Digital repertoires of poetry metrics 11 Data in Linguistics: Managing, Building and Using Linked Language Resources. Eds. P. McCrae,J., Chiarcos, C., Montiel Ponsoda, E., Declerck, T., Osenova, P. and Hellmann,S. http://www.lrec-conf.org/proceedings/lrec2016/workshops/ LREC2016Workshop-LDL2016_Proceedings.pdf - accessed December 8, 2016 (2016) 7. Semantic Web, https://www.w3.org/standards/semanticweb/ - accessed Decem- ber 8, 2016 Levels for Dublin Core Metadata, http://dublincore.org/ 8. Interoperability documents/interoperability-levels/ - accessed December 8, 2016 9. The Singapore Framework for Dublin Core Application Proles, http:// dublincore.org/documents/singapore-framework/ - accessed December 8, 2016 10. Baker, T., Dekkers, M., Heery, R., Patel, M., Salokhe, G.: What terms does your metadata use? Application proles as machine-understandable narratives. Journal of Digital information 2, 2 (2001) 11. Curado Malta, M. and Baptista, A. A.: Me4DCAP V0.1: A method for the develop- ment of Dublin Core Application Proles. In Proceedings of the 17th International Conference on Electronic Publishing - Mining the Digital Information Networks, pp. 33  44. IOS Press (2013) 12. Curado Malta,M. & Baptista, A. A.: A Method for the Development of Dublin Core Application Proles (Me4DCAP V0.2): Detailed Description. In International Conference on Dublin Core and Metadata Applications, pp.90-103. Dublin Core (2013) 13. Curado Malta, M. & Baptista, A.A.: State of the Art on Methodologies for the Development of a Metadata Application Prole. In proceedinggs of MTSR 2012, CCIS 343, pp. 6173. Springer-Verlag, Berlin Heidelberg (2012) 14. IFLA: Functional Requirements for Bibliographic Records. International Federa- tion of Library Associations and Institutions (2009) 15. Hevner, A.: The three cycle view of design science research. Scandinavian Journal of Information Systems. vol. 19 (2). pp. 87 (2007)