<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DAICA - Digital Assistant Investigating Cultural Assets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lothar Hotz</string-name>
          <email>hotz@informatik.uni-hamburg.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dan Cristea</string-name>
          <email>dcristea@info.uaic.ro</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Justyna Pietrzak</string-name>
          <email>justyna@eleka.net</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Povazay</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brigitte Rauter</string-name>
          <email>brigitte.rauterg@psolutions.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniela Buleandra</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eleka Ingeniaritza Linguistikoa S.L.</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>HITeC e.V. c/o University of Hamburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>P.Solutions Informationstechnologien GmbH</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>SIVECO Romania SA</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University Alexandru Ioan Cuza and Romanian Academy</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Besides web pages, the web o ers access to an immense variety of digitized source material, inventories and catalogues hosted by libraries and archives relevant for humanities and social sciences (HSS) studies. In practice, a remote access to HSS information is considerably hampered by several barriers: Researchers interested in a speci c topic do not know which institution harbors information related to a speci c topic; Data collections are equipped with unique user interfaces and o er di erent data structures; Language barriers impede information exploitation; Retrieval mechanisms do not provide intelligent access to semantically related information. In this paper, we describe an Digital Assistant Investigating Cultural Assets (DAICA) for research and information procurement in HSS, guided by the vision of a digital information space of cultures. The DAICA will support HSS studies by autonomously identifying appropriate resources and presenting topical investigation results. In particular, the DAICA will integrate technology and provide a solution for analysing historical digitized documents, performing semantical search in deep data structures, automatic translation, extending a search by meaningful relations, creating summaries of identi ed resources, and providing user interactions for complex search results.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic search</kwd>
        <kwd>machine translation</kwd>
        <kwd>summarization</kwd>
        <kwd>optical character recognition</kwd>
        <kwd>cultural heritage</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>information, organized on local websites and in principle accessible for inquiries
and research from anywhere in the Internet, however not by ordinary search
engines - they reside in the "deep web" and require special access methods.</p>
      <p>Resources exist in a multitude of media types: unstructured and structured
text, with and without metadata, full text accessible or not, images, and videos.
Media comprise historical or cultural texts, biographical information,
newspapers articles, maps of cities, regions, and countries, paintings, photographs, and
much more. In addition to varying data formats, the user interfaces to these
resources di er widely, both regarding language and functionality. Hence, research
is di cult, and the outcomes are all too often incomplete and insu cient.
Typically, tedious manual investigations are needed to study a speci c topic: one
has to get into contact with many possible sources, get acquainted with access
modalities, retrieve data, commonly by language-speci c key-word search,
determine relevance across language barriers, and summarise the resulting material.
In HSS studies, the task is further aggravated because of diverse cultural
regions, each with its own history and tradition, and hence, often with a di erent
understanding of seemingly identical terms.</p>
      <p>This situation is faced by scholars, students, journalists, public, and media
if they want to investigate abroad data sources, not only data of their local
libraries. In principle the same applies to investigations by business and
commercial aggregators. There is clearly the need to pave the way for barrier-free
access, i.e., support in accessing distributed repositories, in translation services,
in fast interpretation of new unknown resources, and collaboration with other
interested users about this matter to cultural data.</p>
      <p>Fortunately, basic technology for multilingual and semantically enhanced
search in multimedia databases is available. But in order to e ectively
support HSS studies, techniques have to be adapted and integrated. For example,
relevance criteria including temporal and geographical proximity, cultural
vicinity, or taxonomical distance of terms must be taken into account for a semantic
search to be e ective. Optical character recognition (OCR) must be invoked
for searching scanned documents; Machine translation must be used for
identifying, linking and presenting relevant multilingual data, summarisation and
visualization and user-device interaction facilities must be provided for complex
and multifaceted data. Altogether, these methods can allow e ective retrieval of
historical and cultural data from the deep web.</p>
      <p>As a technological innovation, this paper presents a concept of an intelligent
HSS research assistant called Digital Assistant Investigating Cultural Assets,
DAICA, which will be used by scholars of the humanities and by the interested
public or the industry having the need for investigation a speci c topic.
Summarisations will be returned of semantically related documents, articles, texts
not only of web-pages but also of other data sources.</p>
      <p>This paper describes the use cases of a DAICA (Section 2), a new search
procedure in Section 3 integrating semantic search, machine translation, OCR,
entity discorvery, and summarization. These technologies are integrated in a
con gurable framework (Section 4 and Section 5).</p>
    </sec>
    <sec id="sec-2">
      <title>A Use Case of DAICA</title>
    </sec>
    <sec id="sec-3">
      <title>A New Integrated Search Procedure</title>
      <p>A main task of DAICA is the transparent integration of semantic search, machine
translation, and all other capabilities such as summarization, OCR, and entity
discovery (see Figure 2 which illustrates the complete process). A user spells out
implicitly or explicitly the search speci cation (query) in his/her language, the
source language, in the example German. DAICA translates this query and its
ontological enhancements into the target language (here Romanian) of a data
source and starts the search. The results and their enhancements through links
and summarisation are again translated back into the source language. In the
following section, the details about the involved components are given.</p>
    </sec>
    <sec id="sec-4">
      <title>The DAICA Framework</title>
      <p>In order to obtain sustainable, reusable results the objective of DAICA is to build
a general framework which will be used for the development of complex
specialised architectures, each accommodating di erent capabilities, selected
during con guration sessions. These capabilities realize the basic technologies for
investigation tasks, i.e., optical character recognition (OCR) for enabling the
translation of text images into words, search which takes the meaning of queries
and documents into account, machine translation for interpreting documents in
foreign languages, summarisation for getting a quick overview of a document or
article, entity and link discovery for identifying important persons and subjects
in a text. However, it is not evident which capabilities to use at which time, or,
if capabilities come in variants, which version is best for a given investigation
task.</p>
      <p>Therefore, DAICA is de ned as a framework, its infrastructure, and a suitable
interface technology which can be used to interactively assemble architectures
for selecting suitable components which implement the capabilities for a given
investigation task. Various kinds of users, "aggregators", will use this kit to
build DAICA con gurations that best support their own needs or meet the
investigation requirements of others.</p>
      <p>DAICA instantiations will constitute another layer of outputs resulting
from DAICA interactions. A DAICA instantiation (or instance) represents the
combination between a DAICA con guration and a speci c set of acquired
resources from di erent data sources (usually referring to a speci c topic a certain
user has worked on). These resources will be accumulated by a user (or a
community of users) during a series of work sessions with DAICA. Hence, all interactions
typically with speci c investigation goals and directed to speci c resources, are
stored, catalogued and o ered for post-research re-use, for the bene t of their
creators or future users. Hence, DAICA instantiations are collections of resources
with respect to a topic.</p>
      <p>Examples of DAICA con gurations can be:
{ DAICA-1: Capability to process contemporary German, Spanish, English
and Romanian, with OCR, indexing, external linking of name entities,
summarisation, and translation between these four languages;
{ DAICA-2: Processing German and Romanian texts from 1850 to present
date, OCR including the Gothic German and the transitional Cyrillic
alphabet used in Romania in the middle of the XIX-th century, indexing and
external linking of name entities, time expressions, summarisation, and translation
between these two languages.</p>
      <p>Examples of DAICA instantiations can be:
{ Based on DAICA-2: Links to the bibliographical sources in the Library of
Hamburg and Academy Library of Bucharest, knowledge-base with dated
entries related to the migration in Germany and Romania in the XIX-th
century;
{ Based on DAICA-1: Links of German academic libraries with information
from Basque archives, in relation to investigations accomplished by German
scholars in the Basque Country in XIX century.</p>
      <p>Hence, an instantiation summarizes all information about a speci c topic,
e.g., content identi ed in some libraries, notes made by the user. Furthermore,
if made public, other users can make use of and re ne such previously created
instantiations through the DAICA instantiation retrieval mechanism. As such,
the DAICA instantiations are the base for building a community discussing and
further developing cultural topics.</p>
      <p>DAICA uses a number of already existent technologies, which will be adapted
to comply with the actual requirements. The DAICA capabilities include the
following features:
{ Speci cation and customization of the investigation task (proactive and
triggered search speci cations);
{ Speci cation of data sources, their access, and their content in the form
of metadata schemas, languages, or ontologies and used terminology, hence
enabling access to foreign data and content without the need of manually
traversing user interfaces or interpreting a library structure (by data source
pro les);
{ Analysis of ancient digitized documents (by pattern recognition and OCR,
word spotting);
{ Deep semantic search through data sources (by indexing and semantic search);
{ Automatic translation of queries and resources (by machine translation);
{ Linking of resources with expressive relationships on the basis of semantic
entities (by entity identi cation, reference resolution, detection of temporal
and spatial relations);
{ Creating summaries of the identi ed resources (by automatic
summarisation);
{ Friendly end-user interfaces for the visualization of complex search results
and dependencies in the Web for di erent types of devices: laptop, tablet,
smart phone (by innovative visualization and user-device interaction);
{ Easy con guration of new processing architectures to support a wide range
of thematic investigations (by con guration facilities);
{ Projects pro ling for storing, retrieving and sharing of resources as DAICA
instances (by instantiation facilities).</p>
      <p>In summary, these features will be integrated in a generally applicable and
customizable technological framework that will allow easy con guration of new
architectures in order to help researchers and other categories of users to perform
assisted cultural HSS investigations. Once installed in a DAICA platform, the
framework can be used by aggregators (libraries, research institutes,
administration) to con gure new applications that will allow public users to get access to
new data or to administer previously curated DAICA instantiations.
5.1</p>
    </sec>
    <sec id="sec-5">
      <title>Technologies for DAICA</title>
      <sec id="sec-5-1">
        <title>Existing investigation tools</title>
        <p>The widely used Aleph integrated library system provides academic, research,
and national libraries with the e cient, user-friendly tools and work ow
support they need to meet the increasing requirements of the industry today and
in the future. Built on an Oracle database, Aleph runs on a range of
operating systems. Employing system-wide XML technology, Aleph o ers third-party
integration through an XML gateway. The product is based on industry
standards, o ering the ultimate in resource-sharing capabilities, full connectivity,
and seamless interaction with other systems and databases.</p>
        <p>Another solution used in libraries is DigiTool, which enables academic
libraries and library consortia to manage and provide access to digital resources,
both those that are created for use within the institution and those that are
collected and maintained by the library for the bene t of the public.</p>
        <p>Since many resources have a public exposure on the Web, other existing
investigation tools or techniques which can be used for searching are the Web search
engines and crawlers. Some open source or commercial tools (which can in
uence the solution) are: Datapark, ebhath, Eureka, Indri, ISearch, IXE, Lucene,
Managing Gigabytes (MG), MG4J, mnoGoSearch, MPS Information Server,
Namazu, Nutch, Omega, OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb,
SWISHE, SWISH++, Terrier, WAIS/ freeWAIS, WebGlimpse, XML Query Engine,
XMLSearch, Zebra, BBDBot and Zettair.</p>
        <p>Besides search itself, one technique to be used when combining HSS data
from multiple sources is data integration. State of the art approaches for data
integration have adopted a schema- rst (e.g., ETL, enterprise integration), a
schema-never (e.g., search engines), or a schema-later (e.g., dataspaces)
methodology.</p>
        <p>Such tools provide the basic search and data access interfaces to library
content. For DAICA, libraries operating those tools can and will be integrated
through data source pro les. Furthermore, the provided search facilities of the
tools will be used by the semantic search capability to perform keyword-based
search.</p>
        <p>A lot cultural assets are currently published through EUROPEANA.
EUROPEANA bases its search functionalities on who, what, where, when and
corresponding restrictions for media type, language, country, and provider. DAICA
will base the search on semantic ontologies and multilingual access, thus,
facilitating the document access for users. However, through the envisioned data
source pro les, EUROPEANA can be integrated in the DAICA framework and,
thus, be part of a DAICA investigation.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>OCR, pattern recognition</title>
        <p>
          For identi cation and retrieval of digitzed but not yet recognized documents,
DAICA includes OCR (Optical Character Recognition) tools. The main
challenges which have to be faced are the following:
{ OCR must be based on a variety of historical fonts and spellings.
{ Document images may have poor quality and may require image
enhancement.
{ Character and word recognition may be ambiguous due to noise.
{ Word, sentence and semantic context must be exploited for disambiguation.
There exist several commercial and open source OCR tools, which perform
highquality OCR (up to 99%) for standard fonts and low-noise conditions [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. On
the other hand, character and word recognition results may be quite poor
(below 80%) without prior knowledge of the font and without exploiting context
information.
        </p>
        <p>Hence, DAICA applies several innovative techniques to achieve high-quality
OCR. First, OCR tasks are supported by their semantic context using meta-data
and ontologies. Hence, ambiguities can be signi cantly reduced. For example,
ambiguous readings can be refuted if the semantic distance (computed from
an ontology such as WordNet) to the investigation topic exceeds a threshold.
As a second innovative technique, applicable to manuscripts or unusual fonts,
DAICA will allow word spotting based on patterns supplied by the investigator.
This way, occurrences of similar patterns can be retrieved from a document. A
third technique, mainly applicable to handwritten documents, will be the use of
an advanced text-line nder which can cope with varying line orientations.</p>
        <p>
          Thus, the approach for DAICA will be mainly based on existing OCR tools
of the partners and open source tools, as well as low-level and context-supported
computer vision and manuscript analysis [
          <xref ref-type="bibr" rid="ref2 ref3 ref4">2,3,4</xref>
          ].
5.3
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>Semantic search</title>
        <p>A central goal of the DAICA is to provide support for studies of cultural heritage
by extending keyword-based search to a much broader search based on semantic
relations. A semantic search has the advantage of narrowing down ambiguous
word meanings, especially across language barriers, and allowing proactive
background search for related information. This goal can be achieved by a variety
of techniques which try to take the intention of the user and the meaning of a
query into account when searching in data sources.</p>
        <p>
          There exist several approaches for semantic search as documented, for
example, in the surveys, [
          <xref ref-type="bibr" rid="ref5 ref6 ref7">5,6,7</xref>
          ]. DAICA lays the focus on exploiting ontological
information which are used in two fundamental ways: (i) to de ne, re ne and
expand the query topic; (ii) to nd semantically related information in data
sources.
        </p>
        <p>Several publicly accessible implementations of semantic search approaches
exist, including QWant, GoPubMed, Swoogle, and Google's Knowledge Graph,
which deal with speci c kinds of ontological representations. These techniques,
however, do not meet the requirements for the intelligent agent conceived in
this work: (i) DAICA will have to access a large number of heterogeneous
content structures used in the archiving institutions for cultural heritage or similar
data aggregations. Some may be supported by full- edged Semantic Web
ontologies, others by customized categorization schemes. In consequence it will be
necessary to invoke ontology alignment in some form; (ii) Search will be
multilingual, crossing language barriers between the user and information sources;
(iii) In DAICA, the user can de ne a query by several kinds of topic descriptions,
ranging from keywords, annotated images, graphic patterns, to coherent texts.
Hence several heterogeneous measures for semantical distance will play a part,
for example taxonomical distance, relatedness by names, time or geographical
location, or chains of ontological structures; (iv) The user will be supported by
proactive search, i.e., by autonomous background explorations through entity
and link discovery in user's text writing; (v) Access to DAICA will be possible
via mobile devices, and rendition of results will include summarisation.</p>
        <p>The software for individual techniques is mostly available either as open
source or detained by the authors. The main task for the DAICA is to conceive
and integrate a tool combining the techniques in a user-friendly way.
5.4</p>
      </sec>
      <sec id="sec-5-4">
        <title>Ontology management</title>
        <p>In our approach, ontologies play an important role for obtaining meaningful
search results in support of a user's investigation. All essential DAICA
functionalities resort to ontologies, in particular semantic search, language translation,
interpreted OCR, entity discovery, topical linking, and summarisation.
Ontologies may provide concept names and de nitions in terms of relations to other
concepts, for example generalization, specialization, synonyms and antonyms.
Standardized properties relate entities to important search criteria, such as
location and time.</p>
        <p>
          Due to the highly heterogeneous data sources of the cultural heritage and
diverse evolved standards, investigations with DAICA have to cope with
multiple ontologies in di erent languages, ranging from carefully designed OWL
ontologies to simple databases characterized by metadata schemes. In order to
determine the relevance of resources for a user query, DAICA must be able
to align these ontologies with the semantics of the query. Several methods for
query answering based on multiple and multilingual ontologies have been
developed in the past decade, see [
          <xref ref-type="bibr" rid="ref10 ref8 ref9">8,9,10</xref>
          ] for surveys. Typically, there is a matching
(or alignment) step where correspondences between heterogeneous ontologies are
determined, and an interpretation step, where information relevant for a query
is extracted.
        </p>
        <p>In the DAICA infrastructure, ontology matching and interpretation will be
performed for ontologies based on standards such as Dublin Core or Schema.org,
on controlled vocabularies (WordNet and thesaurus vocabularies), and on
existing biographical data standards and classi cations. In several countries data
sources are described by authority les of standardized metadata, in Germany:
Gemeinsame Normdatei, beacon les, gazetteer data and links, les for Common
public corporation data (GKD, Gemeinsame Korperschaftsdatei) with company
and institution names, registers with personal names (PND), and Common norm
data le (SWD, Schlagwortnormdatei) with commonly used tag words,
categories, and subject headings. Mass data with named entity tagging and
recognition data will enhance the scope of results and open up semantic relations and
links to more resources.
5.5</p>
      </sec>
      <sec id="sec-5-5">
        <title>DAICA instantiation retrieval</title>
        <p>A DAICA instantiation represents a DAICA con guration and the resources
acquired by a user or a community of users having close scienti c interests for a
speci c topic using this con guration. As such, a DAICA instantiation represents
a complete investigation case which is both, a useful documentation for the
investigator and a valuable resource for similar investigations of other users. It
is the objective of DAICA to support all users of the DAICA community by a
case base of instantiations and case-based retrieval mechanisms.</p>
        <p>
          Case-based information retrieval is a well-established technology, see [
          <xref ref-type="bibr" rid="ref11 ref12">11,12</xref>
          ]
for surveys. While case-based retrieval has been originally conceived for
featurebased object representations, applications to relational structures have proved
quite successful [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. More recently, case-based retrieval was further enhanced
by ontology-based representations and corresponding similarity measures [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
During the development of DAICA, a special theoretical attention will be given
to an ontological organisation of the collections of DAICA instantiations. For
example, issues of interest here are: demarcation strategies (when is it that two
instances have to be considered as identical or distinct?), inheritance (is it that
an instantiation A inherits parts of descriptions, sources, links, etc. from an
instance B?).
5.6
        </p>
      </sec>
      <sec id="sec-5-6">
        <title>Machine translation, multilingual processing in combination with semantic search and summarisation</title>
        <p>
          Developing of e cient machine translation is a long-lasting and multi-level
process. DAICA uses a mixture of the mature technologies of statistical,
examplebased and and rule-based machine translation (SMT and RBMT). As basic
features, DAICA includes:
{ Resource collection (semi-automatic parallel corpora extraction and
dictionary building, with special emphasis on lesser-resourced languages and
indomain registers). These data will be used for training and tuning machine
translation modules.
{ Development of the query translation module. Previous experience and
expertise of the partners will be used for adapting existing methods to language
pairs of DAICA. SMT is language-independent, and the same toolkit can be
used for any pair of languages provided speci c single language texts and
parallel texts for all language translation pairs exist. But the state of the
development rule-based machine translation (RBMT) varies, depending on
the language pair.
{ DAICA will use the Apertium software [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Apertium is a classical
shallowtransfer or transformer system, released under GNU Licence. Apertium
includes dictionaries for language pairs involving Spanish. The Apertium MT
engine consists of the pipelined modules for morphological anlaysis,
part-ofspeech tagger, and text generators as well as Statistical Machine Translation
(SMT) based on the Moses toolkit [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>Hence, our summariser is multilingual at the architectural level, meaning that
it incorporates a pipeline of modules which has the same structure irrespective of
the language of the processed document. However, initial elements of this chain
(among which, the tokeniser, the POS-tagger, the lemmatiser, the NP-chunker,
the clause splitter, the name entity recogniser, and the anaphora resolver) are
strongly language dependent.</p>
        <p>In the former ATLAS project6, summarisers for Bulgarian, German, Greek,
Polish and Romanian have been built, meaning that our general summarisation
architecture has been adapted for all these languages by assembling basic-levels
NLP modules supplied by partners. For DAICA, we will build summarisers for
German, English, and Romanian, by re-using (and, where necessary, also
enhancing) the German and Romanian basic components and including open-source
modules for English.</p>
        <p>
          The situation is somehow di erent when the language of the documents is
old. Romanian or, for instance, has changed dramatically over time. Not only the
lexica, grammar and syntax have evolved, but also the alphabet has changed from
Old Cyrillic to Latin, with a mixture of the two, called the Cyrillic Transition
Alphabet, used for a period in the middle of the XIX-th century. Based on
previous work [
          <xref ref-type="bibr" rid="ref17 ref18">17,18</xref>
          ] we will study on diachronic Romanian morphology.
5.7
        </p>
      </sec>
      <sec id="sec-5-7">
        <title>Entity and link discovery</title>
        <p>Recognition of entity mentions in texts (names of people, moments of time,
countries, locations, events, organizations) and their correct interpretation in
context is an issue of primary importance in DAICA. These mentions should
open access gates to entries in the collection of accessible resources. Examples
for points of interests are:
{ Identify entity mentions in metadata eld values and full texts and, if
necessary, do their ontological interpretations, e.g., identify temporal entities and
historical dates and events;
{ Identify relevant relations between entities such as relations between
instances: &lt;person&gt; is-in &lt;place&gt; (at &lt;time&gt;), &lt;country&gt; invades &lt;country&gt;,
&lt;person&gt; signs &lt;treaty&gt;, etc.
{ Identify collections of documents having contingent content such as &lt;document&gt;
is-primary-source-for &lt;event&gt;, &lt;document&gt; in-relation-with &lt;event&gt;, &lt;document&gt;
mentions &lt;person&gt; (at &lt;time&gt;);
6 www.atlasproject.eu</p>
        <p>One approach for entity discovery is the use of large repositories of entity
names (gazetteers), such as person names, topics, locations, or temporal
mentions. Ontologies and terminological databases can equally be used. This
approach might look brute force, however, because of the existence of authority
les and terminologies in library research, there is a huge amount of such entity
storages and ontologies which can be used, similar to those mentioned in Section
5.4. Furthermore, we have recently built a large collection of regular expressions
for the identi cation of geographical locations in free texts. As other means,
larger contexts and syntactic analysis can be used to identify relations between
entities. Once such relations are detected, the documents containing them can
be tagged and indexed accordingly, providing information that can be used for
intelligent retrieval.
5.8</p>
      </sec>
      <sec id="sec-5-8">
        <title>Summarisation</title>
        <p>
          Nowadays the quantity and diversity of data in the internet on whatever subject
is extremely vast, which makes it more and more di cult to enquire on speci c
subjects. The needed information is usually hidden in an ocean of garbage data.
This is one aspect of the well-known problem of information overload. One way
to deal with it is to use summarisation techniques. In [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] a summary of a text is
de ned as a piece of text that conveys important information of the original one
and that is not longer than half of the original length, usually signi cantly less
than that. Summarisation is a hard problem in Natural Language Processing
because, in order to do it properly, one has to really understand the point of a text.
This requires semantic analysis, linking of mentioned entities (usually referred
as anaphora resolution), discourse processing, and inferential interpretation.
        </p>
        <p>Text summarisation methods can be classi ed into extractive and abstractive.
An extractive summary includes sequences of words taken from the original
document, which could be clauses, sentences or paragraphs. An abstractive summary
does not reproduce sequences from the original document, but rather includes
paraphrases of sections that mention important facts, events, and entities.</p>
        <p>Many systems are known which perform automatic text summarisation,
applying di erent techniques. Some use surface methods (involving no linguistic
analysis but exploiting instead the format of the document), some take name
entities from the original text as pivot elements and assume that the texts
surrounding them is important and should stay in the summary (involving some
kind of lexical analysis and classi cation methods); some relate signi cant
features in the text and the summary and try to copy the ability to produce
summaries from human-produced ones (involving learning and statistical methods);
and some are based on discovering the discourse structure (involving
processing at linguistic, syntactic and discourse level). The summarisation systems can
also be considered from the point of view of the number of the processed texts,
as single and multi-document, by the languages processed, as monolingual or
multilingual, as well as by the genre of the processed texts.</p>
        <p>
          The summarisation approach in DAICA is an extractive single-document
process producing general or focussed summaries. It will enhance the approach
described in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], which is currently considered as one of the leading approaches
in state-of-the-art automatic summarisation. It involves a long processing chain
(including a tokeniser, a part-of-speech tagger, a noun phrase chunking module,
a name entity recogniser, an anaphora resolution module, a clause splitter and
a discourse parser). The improvements that we plan to realise in DAICA on
the multilingual summarisation model, initially built in the ATLAS project, will
concern a number of directions, including (i) the anaphora resolution engine
by adding rules that would allow coreference resolution on more ner criteria,
(ii) the clause splitter module - by implementing and integrating in the
calibration system of new machine learning algorithms, (iii) the discourse parser
by integrating the newly acquired enhancements of the Veins Theory [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ],
focussed towards reducing the search space in an incremental parsing process [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ],
or (iv) those using the recently proposed metrics of comparing tree structures
[Mitocariu et al., 2013].
6
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Summary</title>
      <p>In this paper, we presented a concept for a digital assistant which integrates
and combines semantic technologies such as interpreted OCR, semantic search,
summarization, entity and link discovery, machine translation, and case-based
retrieval for supporting users in investigation and research tasks. As a main
focus, the assistant consideres resources of cultural heritage data sources such
as libraries. However, the underlying technologies allow the application of the
DAICA concept to arbitray Internet sources such as web pages or social
media data. This paper represents a preliminary step of re ning the conceptual
and design principals before starting the actual development process of a new
technology. However, the basic technologies which will be used for a complete
DAICA system have been applied by us in similar approaches. We believe that
the present day technologies, belonging to the domain of Arti cial Intelligence,
that have attained a theoretical and applicational maturity can be combined in
DAICA in a very creative way.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Optical Character Recognition Techniques: A Survey</article-title>
          .
          <source>Journal of Emerging Trends in Computing and Information Sciences</source>
          <volume>4</volume>
          (
          <issue>6</issue>
          ) (
          <year>June 2013</year>
          )
          <volume>545</volume>
          {
          <fpage>550</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Buhr</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Evaluation of Retrieval Performance in Historical Newspaper Archives comparing Page-level and Article-level Granularity</article-title>
          .
          <source>Technical Report Technical Report FBI-HH-M-337/06</source>
          , Universitat Hamburg,
          <string-name>
            <surname>Hamburg</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hotz</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Terzic</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>High-Level Expectations for Low-Level Image Processing</article-title>
          .
          <source>In: Proceedings of the 31st Annual German Conference on Arti cial Intelligence</source>
          . Volume
          <volume>5243</volume>
          of Springer Lecture Notes in Computer Science.,
          <string-name>
            <surname>Kaiserslautern</surname>
          </string-name>
          (
          <year>September 2008</year>
          )
          <volume>87</volume>
          {
          <fpage>94</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Herzog</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Computer-based Stroke Extraction in Historical Manuscripts</article-title>
          . Manuscript Cultures,
          <source>Newsletter (3)</source>
          (
          <year>2011</year>
          )
          <volume>14</volume>
          {
          <fpage>24</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Grimes</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Breakthrough Analysis: Two + Nine Types of Semantic Search</article-title>
          .
          <source>InformationWeek</source>
          <year>2010</year>
          (
          <year>2010</year>
          )
          <volume>1</volume>
          {
          <fpage>21</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mangold</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A survey and classi cation of semantic search approaches</article-title>
          .
          <source>International Journal of Metadata, Semantics and Ontology</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          ) (
          <year>2007</year>
          )
          <volume>23</volume>
          {
          <fpage>34</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Makela, E.:
          <article-title>Survey of Semantic Search Research</article-title>
          .
          <source>In: Proceedings of the Seminar on Knowledge Management on the Semantic Web</source>
          , Department of Computer Science, Univ. Helsinki (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>An approach to the management of multiple aligned multilingual ontologies for a geospatial earth observation system</article-title>
          .
          <source>In: Proc. 4th Int. Conf. on GeoSpatial Semantics</source>
          , Springer (
          <year>2011</year>
          )
          <volume>52</volume>
          {
          <fpage>69</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rameshi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gnanasekaran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Methodology Based Survey on Ontology Management</article-title>
          .
          <source>International Journal of Computer Science &amp; Engineering Survey (IJCSES) 1</source>
          (
          <issue>1</issue>
          ) (
          <year>2010</year>
          )
          <volume>1</volume>
          {
          <fpage>12</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Granitzer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sabol</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Onn</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lukose</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tochtermann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Ontology Alignment - A Survey with Focus on Visually Supported Semi-Automatic Techniques</article-title>
          .
          <source>Future Internet</source>
          <volume>2</volume>
          (
          <year>2010</year>
          )
          <volume>238</volume>
          {
          <fpage>258</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Daniels</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rissland</surname>
            ,
            <given-names>E.L.:</given-names>
          </string-name>
          <article-title>A case-based approach to intelligent information retrieval</article-title>
          .
          <source>In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , ACM (
          <year>1995</year>
          )
          <volume>238</volume>
          {
          <fpage>245</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudhuri</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Information Retrieval from Documents: A Survey</article-title>
          .
          <source>Information Retrieval</source>
          <volume>2</volume>
          (
          <year>2000</year>
          )
          <volume>141</volume>
          {
          <fpage>163</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. de Mantaras,
          <string-name>
            <given-names>R.L.</given-names>
            ,
            <surname>McSherry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Bridge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Leake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Smyth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Craw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Faltings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Maher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.L.</given-names>
            ,
            <surname>Cox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.T.</given-names>
            ,
            <surname>Forbus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Keane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Aamodt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Watson</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          :
          <article-title>Retrieval, reuse, revision, and retention in case-based reasoning</article-title>
          .
          <source>The Knowledge Engineering Review</source>
          <volume>20</volume>
          (
          <issue>3</issue>
          ) (
          <year>2005</year>
          )
          <volume>215</volume>
          {
          <fpage>240</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Zidi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouhana</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mourad</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fekih</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>An ontology-based personalized retrieval model using case base reasoning</article-title>
          .
          <source>In: Proc. 18th Int. Conf. on KnowledgeBased and Intelligent Information &amp; Engineering Systems</source>
          . (
          <year>2014</year>
          )
          <volume>212</volume>
          {
          <fpage>222</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Forcada</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          , Ginest -Rosell,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Nordfalk</surname>
          </string-name>
          , J.,
          <string-name>
            <given-names>O</given-names>
            <surname>'Reagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Ortiz-Rojas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Perez-Ortiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            ,
            <surname>Sanchez-Mart nez</surname>
          </string-name>
          , F.,
          <string-name>
            <surname>Ram</surname>
            rez-Sanchez,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tyers</surname>
            ,
            <given-names>F.M.:</given-names>
          </string-name>
          <article-title>Apertium: a free/open-source platform for rule-based machine translation</article-title>
          .
          <source>Machine Translation</source>
          <volume>25</volume>
          (
          <issue>2</issue>
          ) (
          <year>2011</year>
          )
          <volume>127</volume>
          {
          <fpage>144</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Federico</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertoldi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cowan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moran</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zens</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Constantin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herbst</surname>
          </string-name>
          , E.:
          <article-title>Moses: Open Source Toolkit for Statistical Machine Translation</article-title>
          .
          <source>In: ACL</source>
          <year>2007</year>
          :
          <article-title>proceedings of demo and poster sessions</article-title>
          . (
          <year>2007</year>
          )
          <volume>177</volume>
          {
          <fpage>180</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Simionescu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cristea</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Automatic Morphologic Classi cation System for Romanian</article-title>
          . In et al., L., ed.:
          <source>BringITon! 2012 Catalog. Editura Univercity Al. I. Cuza</source>
          , Iasi, Romania (May
          <year>2012</year>
          )
          <volume>52</volume>
          {
          <fpage>53</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Cristea</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simionescu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haja</surname>
          </string-name>
          , G.:
          <article-title>Reconstructing the Diachronic Morphology of Romanian from Dictionary Citations</article-title>
          .
          <source>In: Proceedings of LREC-2012</source>
          , Instanbul, Turkey (May
          <year>2012</year>
          )
          <volume>923</volume>
          {
          <fpage>927</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Radev</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McKeown</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Automatic Morphologic Classi cation System for Romanian</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>28</volume>
          (
          <issue>4</issue>
          ) (
          <year>2002</year>
          )
          <volume>399</volume>
          {
          <fpage>408</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Anechitei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cristea</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimosthenis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignat</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karagiozov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koeva</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Kope'c,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Vertan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Summarizing Short Texts Through a Discourse-Centered Approach in a Multilingual Context</article-title>
          . In Neustein,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Markowitz</surname>
          </string-name>
          , J., eds.: Where Humans Meet Machines:
          <article-title>Innovative Solutions to Knotty Natural Language Problems</article-title>
          . Springer Verlag, Heidelberg/New York (May
          <year>2013</year>
          )
          <volume>109</volume>
          {
          <fpage>136</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Cristea</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ide</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romary</surname>
            ,
            <given-names>L.: Veins</given-names>
          </string-name>
          <string-name>
            <surname>Theory</surname>
          </string-name>
          .
          <article-title>A Model of Global Discourse Cohesion and Coherence</article-title>
          .
          <source>In: Proceedings of 17th International Conference on Computational Linguistics - Coling '98, and the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - ACL '98</source>
          , Montreal, Canada (
          <year>August 1998</year>
          )
          <volume>281</volume>
          {
          <fpage>285</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Mitocariu</surname>
          </string-name>
          , E.:
          <article-title>Veins theory revisited</article-title>
          .
          <source>Dissertation</source>
          ,
          <string-name>
            <surname>Univercity Al. I. Cuza</surname>
          </string-name>
          , Iasi, Romania (2015 - in preparation)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>