<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">DAICA -Digital Assistant Investigating Cultural Assets</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Lothar</forename><surname>Hotz</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">HITeC e.V</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Dan</forename><surname>Cristea</surname></persName>
							<affiliation key="aff2">
								<orgName type="institution">University Alexandru Ioan Cuza and Romanian Academy</orgName>
								<address>
									<country key="RO">Romania</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Justyna</forename><surname>Pietrzak</surname></persName>
							<email>justyna@eleka.net</email>
							<affiliation key="aff3">
								<orgName type="institution">Eleka Ingeniaritza Linguistikoa S.L</orgName>
								<address>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martin</forename><surname>Povazay</surname></persName>
							<email>martin.povazay@psolutions.at</email>
							<affiliation key="aff4">
								<orgName type="department">P.Solutions Informationstechnologien GmbH</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Brigitte</forename><surname>Rauter</surname></persName>
							<email>brigitte.rauter@psolutions.at</email>
							<affiliation key="aff4">
								<orgName type="department">P.Solutions Informationstechnologien GmbH</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Daniela</forename><surname>Buleandra</surname></persName>
							<email>daniela.buleandra@siveco.ro</email>
							<affiliation key="aff5">
								<orgName type="institution">SIVECO Romania SA</orgName>
								<address>
									<country key="RO">Romania</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">University of Hamburg</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">DAICA -Digital Assistant Investigating Cultural Assets</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">9A9185A91CD6DB7AC89F8CD4860964FF</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-19T15:29+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Semantic search</term>
					<term>machine translation</term>
					<term>summarization</term>
					<term>optical character recognition</term>
					<term>cultural heritage</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Besides web pages, the web offers access to an immense variety of digitized source material, inventories and catalogues hosted by libraries and archives relevant for humanities and social sciences (HSS) studies. In practice, a remote access to HSS information is considerably hampered by several barriers: Researchers interested in a specific topic do not know which institution harbors information related to a specific topic; Data collections are equipped with unique user interfaces and offer different data structures; Language barriers impede information exploitation; Retrieval mechanisms do not provide intelligent access to semantically related information. In this paper, we describe an Digital Assistant Investigating Cultural Assets (DAICA) for research and information procurement in HSS, guided by the vision of a digital information space of cultures. The DAICA will support HSS studies by autonomously identifying appropriate resources and presenting topical investigation results. In particular, the DAICA will integrate technology and provide a solution for analysing historical digitized documents, performing semantical search in deep data structures, automatic translation, extending a search by meaningful relations, creating summaries of identified resources, and providing user interactions for complex search results.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The web offers an immense variety of digital or digitized source material, inventories and catalogues relevant for studies in humanities and social sciences (HSS). Institutions such as libraries, museums, archives, local and regional authorities, parliamentary and media documentation services provide a wealth of 2 A Use Case of DAICA Figure <ref type="figure" target="#fig_0">1</ref> presents a use case of a digital assistant presented in this paper with user and background interactions processed by DAICA. Main features are the proactive discovering of the user's interest topics ("A1"), the background actions based on localization if the users is moving ("A2") and on OCR for interpreting original ancient documents. By further observing the user's writing activities, DAICA computes related sources in the web and, hence, supports the user by his investigation activities ("A3"). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">A New Integrated Search Procedure</head><p>A main task of DAICA is the transparent integration of semantic search, machine translation, and all other capabilities such as summarization, OCR, and entity discovery (see Figure <ref type="figure" target="#fig_1">2</ref> which illustrates the complete process). A user spells out implicitly or explicitly the search specification (query) in his/her language, the source language, in the example German. DAICA translates this query and its ontological enhancements into the target language (here Romanian) of a data source and starts the search. The results and their enhancements through links and summarisation are again translated back into the source language. In the following section, the details about the involved components are given. In order to obtain sustainable, reusable results the objective of DAICA is to build a general framework which will be used for the development of complex specialised architectures, each accommodating different capabilities, selected during configuration sessions. These capabilities realize the basic technologies for investigation tasks, i.e., optical character recognition (OCR) for enabling the translation of text images into words, search which takes the meaning of queries and documents into account, machine translation for interpreting documents in foreign languages, summarisation for getting a quick overview of a document or article, entity and link discovery for identifying important persons and subjects in a text. However, it is not evident which capabilities to use at which time, or, if capabilities come in variants, which version is best for a given investigation task.</p><p>Therefore, DAICA is defined as a framework, its infrastructure, and a suitable interface technology which can be used to interactively assemble architectures for selecting suitable components which implement the capabilities for a given investigation task. Various kinds of users, "aggregators", will use this kit to build DAICA configurations that best support their own needs or meet the investigation requirements of others.</p><p>DAICA instantiations will constitute another layer of outputs resulting from DAICA interactions. A DAICA instantiation (or instance) represents the combination between a DAICA configuration and a specific set of acquired resources from different data sources (usually referring to a specific topic a certain user has worked on). These resources will be accumulated by a user (or a community of users) during a series of work sessions with DAICA. Hence, all interactions typically with specific investigation goals and directed to specific resources, are stored, catalogued and offered for post-research re-use, for the benefit of their creators or future users. Hence, DAICA instantiations are collections of resources with respect to a topic.</p><p>Examples of DAICA configurations can be:</p><p>-DAICA-1: Capability to process contemporary German, Spanish, English and Romanian, with OCR, indexing, external linking of name entities, summarisation, and translation between these four languages; -DAICA-2: Processing German and Romanian texts from 1850 to present date, OCR including the Gothic German and the transitional Cyrillic alphabet used in Romania in the middle of the XIX-th century, indexing and external linking of name entities, time expressions, summarisation, and translation between these two languages.</p><p>Examples of DAICA instantiations can be:</p><p>-Based on DAICA-2: Links to the bibliographical sources in the Library of Hamburg and Academy Library of Bucharest, knowledge-base with dated entries related to the migration in Germany and Romania in the XIX-th century;</p><p>-Based on DAICA-1: Links of German academic libraries with information from Basque archives, in relation to investigations accomplished by German scholars in the Basque Country in XIX century.</p><p>Hence, an instantiation summarizes all information about a specific topic, e.g., content identified in some libraries, notes made by the user. Furthermore, if made public, other users can make use of and refine such previously created instantiations through the DAICA instantiation retrieval mechanism. As such, the DAICA instantiations are the base for building a community discussing and further developing cultural topics.</p><p>DAICA uses a number of already existent technologies, which will be adapted to comply with the actual requirements. The DAICA capabilities include the following features:</p><p>-Specification and customization of the investigation task (proactive and triggered search specifications); -Specification of data sources, their access, and their content in the form of metadata schemas, languages, or ontologies and used terminology, hence enabling access to foreign data and content without the need of manually traversing user interfaces or interpreting a library structure (by data source profiles); -Analysis of ancient digitized documents (by pattern recognition and OCR, word spotting); -Deep semantic search through data sources (by indexing and semantic search); -Automatic translation of queries and resources (by machine translation); -Linking of resources with expressive relationships on the basis of semantic entities (by entity identification, reference resolution, detection of temporal and spatial relations); -Creating summaries of the identified resources (by automatic summarisation); -Friendly end-user interfaces for the visualization of complex search results and dependencies in the Web for different types of devices: laptop, tablet, smart phone (by innovative visualization and user-device interaction); -Easy configuration of new processing architectures to support a wide range of thematic investigations (by configuration facilities); -Projects profiling for storing, retrieving and sharing of resources as DAICA instances (by instantiation facilities).</p><p>In summary, these features will be integrated in a generally applicable and customizable technological framework that will allow easy configuration of new architectures in order to help researchers and other categories of users to perform assisted cultural HSS investigations. Once installed in a DAICA platform, the framework can be used by aggregators (libraries, research institutes, administration) to configure new applications that will allow public users to get access to new data or to administer previously curated DAICA instantiations.</p><p>5 Technologies for DAICA</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Existing investigation tools</head><p>The widely used Aleph integrated library system provides academic, research, and national libraries with the efficient, user-friendly tools and workflow support they need to meet the increasing requirements of the industry today and in the future. Built on an Oracle database, Aleph runs on a range of operating systems. Employing system-wide XML technology, Aleph offers third-party integration through an XML gateway. The product is based on industry standards, offering the ultimate in resource-sharing capabilities, full connectivity, and seamless interaction with other systems and databases. Another solution used in libraries is DigiTool, which enables academic libraries and library consortia to manage and provide access to digital resources, both those that are created for use within the institution and those that are collected and maintained by the library for the benefit of the public.</p><p>Since many resources have a public exposure on the Web, other existing investigation tools or techniques which can be used for searching are the Web search engines and crawlers. Some open source or commercial tools (which can influence the solution) are: Datapark, ebhath, Eureka, Indri, ISearch, IXE, Lucene, Managing Gigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu, Nutch, Omega, OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E, SWISH++, Terrier, WAIS/ freeWAIS, WebGlimpse, XML Query Engine, XMLSearch, Zebra, BBDBot and Zettair.</p><p>Besides search itself, one technique to be used when combining HSS data from multiple sources is data integration. State of the art approaches for data integration have adopted a schema-first (e.g., ETL, enterprise integration), a schema-never (e.g., search engines), or a schema-later (e.g., dataspaces) methodology.</p><p>Such tools provide the basic search and data access interfaces to library content. For DAICA, libraries operating those tools can and will be integrated through data source profiles. Furthermore, the provided search facilities of the tools will be used by the semantic search capability to perform keyword-based search.</p><p>A lot cultural assets are currently published through EUROPEANA. EURO-PEANA bases its search functionalities on who, what, where, when and corresponding restrictions for media type, language, country, and provider. DAICA will base the search on semantic ontologies and multilingual access, thus, facilitating the document access for users. However, through the envisioned data source profiles, EUROPEANA can be integrated in the DAICA framework and, thus, be part of a DAICA investigation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">OCR, pattern recognition</head><p>For identification and retrieval of digitzed but not yet recognized documents, DAICA includes OCR (Optical Character Recognition) tools. The main challenges which have to be faced are the following:</p><p>-OCR must be based on a variety of historical fonts and spellings.</p><p>-Document images may have poor quality and may require image enhancement.</p><p>-Character and word recognition may be ambiguous due to noise.</p><p>-Word, sentence and semantic context must be exploited for disambiguation.</p><p>There exist several commercial and open source OCR tools, which perform highquality OCR (up to 99%) for standard fonts and low-noise conditions <ref type="bibr" target="#b0">[1]</ref>. On the other hand, character and word recognition results may be quite poor (below 80%) without prior knowledge of the font and without exploiting context information.</p><p>Hence, DAICA applies several innovative techniques to achieve high-quality OCR. First, OCR tasks are supported by their semantic context using meta-data and ontologies. Hence, ambiguities can be significantly reduced. For example, ambiguous readings can be refuted if the semantic distance (computed from an ontology such as WordNet) to the investigation topic exceeds a threshold. As a second innovative technique, applicable to manuscripts or unusual fonts, DAICA will allow word spotting based on patterns supplied by the investigator. This way, occurrences of similar patterns can be retrieved from a document. A third technique, mainly applicable to handwritten documents, will be the use of an advanced text-line finder which can cope with varying line orientations.</p><p>Thus, the approach for DAICA will be mainly based on existing OCR tools of the partners and open source tools, as well as low-level and context-supported computer vision and manuscript analysis <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Semantic search</head><p>A central goal of the DAICA is to provide support for studies of cultural heritage by extending keyword-based search to a much broader search based on semantic relations. A semantic search has the advantage of narrowing down ambiguous word meanings, especially across language barriers, and allowing proactive background search for related information. This goal can be achieved by a variety of techniques which try to take the intention of the user and the meaning of a query into account when searching in data sources.</p><p>There exist several approaches for semantic search as documented, for example, in the surveys, <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref>. DAICA lays the focus on exploiting ontological information which are used in two fundamental ways: (i) to define, refine and expand the query topic; (ii) to find semantically related information in data sources.</p><p>Several publicly accessible implementations of semantic search approaches exist, including QWant, GoPubMed, Swoogle, and Google's Knowledge Graph, which deal with specific kinds of ontological representations. These techniques, however, do not meet the requirements for the intelligent agent conceived in this work: (i) DAICA will have to access a large number of heterogeneous content structures used in the archiving institutions for cultural heritage or similar data aggregations. Some may be supported by full-fledged Semantic Web ontologies, others by customized categorization schemes. In consequence it will be necessary to invoke ontology alignment in some form; (ii) Search will be multilingual, crossing language barriers between the user and information sources; (iii) In DAICA, the user can define a query by several kinds of topic descriptions, ranging from keywords, annotated images, graphic patterns, to coherent texts. Hence several heterogeneous measures for semantical distance will play a part, for example taxonomical distance, relatedness by names, time or geographical location, or chains of ontological structures; (iv) The user will be supported by proactive search, i.e., by autonomous background explorations through entity and link discovery in user's text writing; (v) Access to DAICA will be possible via mobile devices, and rendition of results will include summarisation.</p><p>The software for individual techniques is mostly available either as open source or detained by the authors. The main task for the DAICA is to conceive and integrate a tool combining the techniques in a user-friendly way.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Ontology management</head><p>In our approach, ontologies play an important role for obtaining meaningful search results in support of a user's investigation. All essential DAICA functionalities resort to ontologies, in particular semantic search, language translation, interpreted OCR, entity discovery, topical linking, and summarisation. Ontologies may provide concept names and definitions in terms of relations to other concepts, for example generalization, specialization, synonyms and antonyms. Standardized properties relate entities to important search criteria, such as location and time.</p><p>Due to the highly heterogeneous data sources of the cultural heritage and diverse evolved standards, investigations with DAICA have to cope with multiple ontologies in different languages, ranging from carefully designed OWL ontologies to simple databases characterized by metadata schemes. In order to determine the relevance of resources for a user query, DAICA must be able to align these ontologies with the semantics of the query. Several methods for query answering based on multiple and multilingual ontologies have been developed in the past decade, see <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10]</ref> for surveys. Typically, there is a matching (or alignment) step where correspondences between heterogeneous ontologies are determined, and an interpretation step, where information relevant for a query is extracted.</p><p>In the DAICA infrastructure, ontology matching and interpretation will be performed for ontologies based on standards such as Dublin Core or Schema.org, on controlled vocabularies (WordNet and thesaurus vocabularies), and on existing biographical data standards and classifications. In several countries data sources are described by authority files of standardized metadata, in Germany: Gemeinsame Normdatei, beacon files, gazetteer data and links, files for Common public corporation data (GKD, Gemeinsame Körperschaftsdatei) with company and institution names, registers with personal names (PND), and Common norm data file (SWD, Schlagwortnormdatei) with commonly used tag words, categories, and subject headings. Mass data with named entity tagging and recognition data will enhance the scope of results and open up semantic relations and links to more resources.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">DAICA instantiation retrieval</head><p>A DAICA instantiation represents a DAICA configuration and the resources acquired by a user or a community of users having close scientific interests for a specific topic using this configuration. As such, a DAICA instantiation represents a complete investigation case which is both, a useful documentation for the investigator and a valuable resource for similar investigations of other It is the objective of DAICA to support all users of the DAICA community by a case base of instantiations and case-based retrieval mechanisms.</p><p>Case-based information retrieval is a well-established technology, see <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref> for surveys. While case-based retrieval has been originally conceived for featurebased object representations, applications to relational structures have proved quite successful <ref type="bibr" target="#b12">[13]</ref>. More recently, case-based retrieval was further enhanced by ontology-based representations and corresponding similarity measures <ref type="bibr" target="#b13">[14]</ref>. During the development of DAICA, a special theoretical attention will be given to an ontological organisation of the collections of DAICA instantiations. For example, issues of interest here are: demarcation strategies (when is it that two instances have to be considered as identical or distinct?), inheritance (is it that an instantiation A inherits parts of descriptions, sources, links, etc. from an instance B?).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">Machine translation, multilingual processing in combination with semantic search and summarisation</head><p>Developing of efficient machine translation is a long-lasting and multi-level process. DAICA uses a mixture of the mature technologies of statistical, examplebased and and rule-based machine translation (SMT and RBMT). As basic features, DAICA includes:</p><p>-Resource collection (semi-automatic parallel corpora extraction and dictionary building, with special emphasis on lesser-resourced languages and indomain registers). These data will be used for training and tuning machine translation modules. -Development of the query translation module. Previous experience and expertise of the partners will be used for adapting existing methods to language pairs of DAICA. SMT is language-independent, and the same toolkit can be used for any pair of languages provided specific single language texts and parallel texts for all language translation pairs exist. But the state of the development rule-based machine translation (RBMT) varies, depending on the language pair.</p><p>-DAICA will use the Apertium software <ref type="bibr" target="#b14">[15]</ref>. Apertium is a classical shallowtransfer or transformer system, released under GNU Licence. Apertium includes dictionaries for language pairs involving Spanish. The Apertium MT engine consists of the pipelined modules for morphological anlaysis, part-ofspeech tagger, and text generators as well as Statistical Machine Translation (SMT) based on the Moses toolkit <ref type="bibr" target="#b15">[16]</ref>.</p><p>Hence, our summariser is multilingual at the architectural level, meaning that it incorporates a pipeline of modules which has the same structure irrespective of the language of the processed document. However, initial elements of this chain (among which, the tokeniser, the POS-tagger, the lemmatiser, the NP-chunker, the clause splitter, the name entity recogniser, and the anaphora resolver) are strongly language dependent.</p><p>In the former ATLAS project<ref type="foot" target="#foot_0">6</ref> , summarisers for Bulgarian, German, Greek, Polish and Romanian have been built, meaning that our general summarisation architecture has been adapted for all these languages by assembling basic-levels NLP modules supplied by partners. For DAICA, we will build summarisers for German, English, and Romanian, by re-using (and, where necessary, also enhancing) the German and Romanian basic components and including open-source modules for English.</p><p>The situation is somehow different when the language of the documents is old. Romanian or, for instance, has changed dramatically over time. Not only the lexica, grammar and syntax have evolved, but also the alphabet has changed from Old Cyrillic to Latin, with a mixture of the two, called the Cyrillic Transition Alphabet, used for a period in the middle of the XIX-th century. Based on previous work <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref> we will study on diachronic Romanian morphology.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7">Entity and link discovery</head><p>Recognition of entity mentions in texts (names of people, moments of time, countries, locations, events, organizations) and their correct interpretation in context is an issue of primary importance in DAICA. These mentions should open access gates to entries in the collection of accessible resources. Examples for points of interests are:</p><p>-Identify entity mentions in metadata field values and full texts and, if necessary, do their ontological interpretations, e.g., identify temporal entities and historical dates and events; -Identify relevant relations between entities such as relations between instances: &lt;person&gt; is-in &lt;place&gt; (at &lt;time&gt;), &lt;country&gt; invades &lt;country&gt;, &lt;person&gt; signs &lt;treaty&gt;, etc. -Identify collections of documents having contingent content such as &lt;document&gt; is-primary-source-for &lt;event&gt;, &lt;document&gt; in-relation-with &lt;event&gt;, &lt;document&gt; mentions &lt;person&gt; (at &lt;time&gt;);</p><p>One approach for entity discovery is the use of large repositories of entity names (gazetteers), such as person names, topics, locations, or temporal mentions. Ontologies and terminological databases can equally be used. This approach might look brute force, however, because of the existence of authority files and terminologies in library research, there is a huge amount of such entity storages and ontologies which can be used, similar to those mentioned in Section 5.4. Furthermore, we have recently built a large collection of regular expressions for the identification of geographical locations in free texts. As other means, larger contexts and syntactic analysis can be used to identify relations between entities. Once such relations are detected, the documents containing them can be tagged and indexed accordingly, providing information that can be used for intelligent retrieval.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.8">Summarisation</head><p>Nowadays the quantity and diversity of data in the internet on whatever subject is extremely vast, which makes it more and more difficult to enquire on specific subjects. The needed information is usually hidden in an ocean of garbage data. This is one aspect of the well-known problem of information overload. One way to deal with it is to use summarisation techniques. In <ref type="bibr" target="#b18">[19]</ref> a summary of a text is defined as a piece of text that conveys important information of the original one and that is not longer than half of the original length, usually significantly less than that. Summarisation is a hard problem in Natural Language Processing because, in order to do it properly, one has to really understand the point of a text. This requires semantic analysis, linking of mentioned entities (usually referred as anaphora resolution), discourse processing, and inferential interpretation.</p><p>Text summarisation methods can be classified into extractive and abstractive. An extractive summary includes sequences of words taken from the original document, which could be clauses, sentences or paragraphs. An abstractive summary does not reproduce sequences from the original document, but rather includes paraphrases of sections that mention important facts, events, and entities.</p><p>Many systems are known which perform automatic text summarisation, applying different techniques. Some use surface methods (involving no linguistic analysis but exploiting instead the format of the document), some take name entities from the original text as pivot elements and assume that the texts surrounding them is important and should stay in the summary (involving some kind of lexical analysis and classification methods); some relate significant features in the text and the summary and try to copy the ability to produce summaries from human-produced ones (involving learning and statistical methods); and some are based on discovering the discourse structure (involving processing at linguistic, syntactic and discourse level). The summarisation systems can also be considered from the point of view of the number of the processed texts, as single and multi-document, by the languages processed, as monolingual or multilingual, as well as by the genre of the processed texts.</p><p>The summarisation approach in DAICA is an extractive single-document process producing general or focussed summaries. It will enhance the approach described in <ref type="bibr" target="#b19">[20]</ref>, which is currently considered as one of the leading approaches in state-of-the-art automatic summarisation. It involves a long processing chain (including a tokeniser, a part-of-speech tagger, a noun phrase chunking module, a name entity recogniser, an anaphora resolution module, a clause splitter and a discourse parser). The improvements that we plan to realise in DAICA on the multilingual summarisation model, initially built in the ATLAS project, will concern a number of directions, including (i) the anaphora resolution engineby adding rules that would allow coreference resolution on more finer criteria, (ii) the clause splitter module -by implementing and integrating in the calibration system of new machine learning algorithms, (iii) the discourse parserby integrating the newly acquired enhancements of the Veins Theory <ref type="bibr" target="#b20">[21]</ref>, focussed towards reducing the search space in an incremental parsing process <ref type="bibr" target="#b21">[22]</ref>, or (iv) those using the recently proposed metrics of comparing tree structures <ref type="bibr">[Mitocariu et al., 2013]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Summary</head><p>In this paper, we presented a concept for a digital assistant which integrates and combines semantic technologies such as interpreted OCR, semantic search, summarization, entity and link discovery, machine translation, and case-based retrieval for supporting users in investigation and research tasks. As a main focus, the assistant consideres resources of cultural heritage data sources such as libraries. However, the underlying technologies allow the application of the DAICA concept to arbitray Internet sources such as web pages or social media data. This paper represents a preliminary step of refining the conceptual and design principals before starting the actual development process of a new technology. However, the basic technologies which will be used for a complete DAICA system have been applied by us in similar approaches. We believe that the present day technologies, belonging to the domain of Artificial Intelligence, that have attained a theoretical and applicational maturity can be combined in DAICA in a very creative way.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Use Case "New Type of Assisted HSS Study"</figDesc><graphic coords="3,134.77,263.22,362.70,351.99" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Process of a search session with ontological enhancement, DAICA instantiation (see next section) retrieval, and machine translation</figDesc><graphic coords="4,135.98,280.03,343.39,348.77" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_0">www.atlasproject.eu</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Optical Character Recognition Techniques: A Survey</title>
		<author>
			<persName><forename type="first">S</forename><surname>Singh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Emerging Trends in Computing and Information Sciences</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="545" to="550" />
			<date type="published" when="2013-06">June 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Evaluation of Retrieval Performance in Historical Newspaper Archives comparing Page-level and Article-level Granularity</title>
		<author>
			<persName><forename type="first">F</forename><surname>Buhr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Neumann</surname></persName>
		</author>
		<idno>FBI-HH-M-337/06</idno>
		<imprint>
			<date type="published" when="2006">2006</date>
			<pubPlace>Hamburg</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Universität Hamburg</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report Technical Report</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">High-Level Expectations for Low-Level Image Processing</title>
		<author>
			<persName><forename type="first">L</forename><surname>Hotz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Neumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Terzić</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 31st Annual German Conference on Artificial Intelligence</title>
		<title level="s">Springer Lecture Notes in Computer Science</title>
		<meeting>the 31st Annual German Conference on Artificial Intelligence</meeting>
		<imprint>
			<publisher>Kaiserslautern</publisher>
			<date type="published" when="2008-09">September 2008</date>
			<biblScope unit="volume">5243</biblScope>
			<biblScope unit="page" from="87" to="94" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Computer-based Stroke Extraction in Historical Manuscripts</title>
		<author>
			<persName><forename type="first">R</forename><surname>Herzog</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Neumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Solth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Manuscript Cultures</title>
		<imprint>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="14" to="24" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
	<note>Newsletter</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Breakthrough Analysis: Two + Nine Types of Semantic Search</title>
		<author>
			<persName><forename type="first">S</forename><surname>Grimes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Infor-mationWeek</title>
		<imprint>
			<biblScope unit="page" from="1" to="21" />
			<date type="published" when="2010">2010. 2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A survey and classification of semantic search approaches</title>
		<author>
			<persName><forename type="first">C</forename><surname>Mangold</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Metadata, Semantics and Ontology</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="23" to="34" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Survey of Semantic Search Research</title>
		<author>
			<persName><forename type="first">E</forename><surname>Mäkela</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seminar on Knowledge Management on the Semantic Web</title>
				<meeting>the Seminar on Knowledge Management on the Semantic Web</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
		<respStmt>
			<orgName>Department of Computer Science, Univ. Helsinki</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">An approach to the management of multiple aligned multilingual ontologies for a geospatial earth observation system</title>
		<author>
			<persName><forename type="first">K</forename><surname>Stock</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 4th Int. Conf. on GeoSpatial Semantics</title>
				<meeting>4th Int. Conf. on GeoSpatial Semantics</meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="52" to="69" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Methodology Based Survey on Ontology Management</title>
		<author>
			<persName><forename type="first">C</forename><surname>Rameshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gnanasekaran</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Science &amp; Engineering Survey (IJCSES)</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="12" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Ontology Alignment -A Survey with Focus on Visually Supported Semi-Automatic Techniques</title>
		<author>
			<persName><forename type="first">M</forename><surname>Granitzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sabol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">W</forename><surname>Onn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lukose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Tochtermann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Future Internet</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="238" to="258" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A case-based approach to intelligent information retrieval</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Daniels</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">L</forename><surname>Rissland</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</title>
				<meeting>the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="1995">1995</date>
			<biblScope unit="page" from="238" to="245" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Information Retrieval from Documents: A Survey</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mitra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chaudhuri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">formation Retrieval</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="141" to="163" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Retrieval, reuse, revision, and retention in case-based reasoning</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">L</forename><surname>De Mantaras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mcsherry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bridge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Leake</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Smyth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Craw</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Faltings</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Maher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Cox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Forbus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Keane</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Aamodt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Watson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Knowledge Engineering Review</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="215" to="240" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">An ontology-based personalized retrieval model using case base reasoning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Zidi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bouhana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mourad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fekih</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 18th Int. Conf. on Knowledge-Based and Intelligent Information &amp; Engineering Systems</title>
				<meeting>18th Int. Conf. on Knowledge-Based and Intelligent Information &amp; Engineering Systems</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="212" to="222" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Apertium: a free/open-source platform for rule-based machine translation</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Forcada</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ginestí-Rosell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Nordfalk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>O'reagan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ortiz-Rojas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Pérez-Ortiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Sánchez-Martínez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ramírez-Sánchez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Tyers</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Translation</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="127" to="144" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Moses: Open Source Toolkit for Statistical Machine Translation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Koehn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hoang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Birch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Callison-Burch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Federico</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bertoldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Cowan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Moran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Bojar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Constantin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Herbst</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACL 2007: proceedings of demo and poster sessions</title>
				<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="177" to="180" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Automatic Morphologic Classification System for Romanian</title>
		<author>
			<persName><forename type="first">R</forename><surname>Simionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cristea</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">BringITon! 2012 Catalog</title>
				<editor>et al., L.</editor>
		<meeting><address><addrLine>Iasi, Romania</addrLine></address></meeting>
		<imprint>
			<publisher>Editura Univercity Al. I. Cuza</publisher>
			<date type="published" when="2012-05">May 2012</date>
			<biblScope unit="page" from="52" to="53" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Reconstructing the Diachronic Morphology of Romanian from Dictionary Citations</title>
		<author>
			<persName><forename type="first">D</forename><surname>Cristea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Simionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Haja</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of LREC-2012</title>
				<meeting>LREC-2012<address><addrLine>Instanbul, Turkey</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012-05">May 2012</date>
			<biblScope unit="page" from="923" to="927" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Automatic Morphologic Classification System for Romanian</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Radev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hovy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mckeown</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="399" to="408" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Summarizing Short Texts Through a Discourse-Centered Approach in a Multilingual Context</title>
		<author>
			<persName><forename type="first">D</forename><surname>Anechitei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cristea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Dimosthenis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ignat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Karagiozov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Koeva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kope'c</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Vertan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Where Humans Meet Machines: Innovative Solutions to Knotty Natural Language Problems</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Neustein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Markowitz</surname></persName>
		</editor>
		<meeting><address><addrLine>Heidelberg/New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer Verlag</publisher>
			<date type="published" when="2013-05">May 2013</date>
			<biblScope unit="page" from="109" to="136" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Veins Theory. A Model of Global Discourse Cohesion and Coherence</title>
		<author>
			<persName><forename type="first">D</forename><surname>Cristea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ide</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Romary</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 17th International Conference on Computational Linguistics -Coling &apos;98, and the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics -ACL &apos;98</title>
				<meeting>17th International Conference on Computational Linguistics -Coling &apos;98, and the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics -ACL &apos;98<address><addrLine>Montreal, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1998-08">August 1998</date>
			<biblScope unit="page" from="281" to="285" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Veins theory revisited</title>
		<author>
			<persName><forename type="first">E</forename><surname>Mitocariu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<pubPlace>Iasi, Romania</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Univercity Al. I. Cuza</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Dissertation</note>
	<note>-in preparation</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
