=Paper= {{Paper |id=Vol-2481/paper23 |storemode=property |title=An Open Science System for Text Mining |pdfUrl=https://ceur-ws.org/Vol-2481/paper23.pdf |volume=Vol-2481 |authors=Gianpaolo Coro,Giancarlo Panichi,Pasquale Pagano |dblpUrl=https://dblp.org/rec/conf/clic-it/CoroPP19 }} ==An Open Science System for Text Mining== https://ceur-ws.org/Vol-2481/paper23.pdf
                            An Open Science System for Text Mining

            Gianpaolo Coro                      Giancarlo Panichi                     Pasquale Pagano
               ISTI-CNR                            ISTI-CNR                              ISTI-CNR
        via Moruzzi 1 Pisa, Italy            via Moruzzi 1 Pisa, Italy             via Moruzzi 1 Pisa, Italy
        coro@isti.cnr.it                    panichi@isti.cnr.it                   pagano@isti.cnr.it



                         Abstract                               of results. Although text mining techniques ex-
                                                                ist that can tackle big data experiments (Gandomi
        Text mining (TM) techniques can ex-
                                                                and Haider, 2015; Amado et al., 2018), few ex-
        tract high-quality information from big
                                                                amples that incorporate OS concepts can be found
        data through complex system architec-
                                                                (Linthicum, 2017). For example, common text
        tures. However, these techniques are usu-
                                                                mining "cloud" services do not allow easy repeata-
        ally difficult to discover, install, and com-
                                                                bility of the experiments by different users and are
        bine. Further, modern approaches to Sci-
                                                                usually domain-specific and thus poorly re-usable
        ence (e.g. Open Science) introduce new
                                                                (Bontcheva and Derczynski, 2016; Adedugbe et
        requirements to guarantee reproducibility,
                                                                al., 2018). Available multi-domain systems do
        repeatability, and re-usability of methods
                                                                not use communication standards (Bontcheva and
        and results as well as their longevity and
                                                                Derczynski, 2016; Wei et al., 2016), and the few
        sustainability. In this paper, we present
                                                                OS-oriented initiatives that use text mining focus
        a distributed system (NLPHub) that pub-
                                                                specifically on documents preservation and cata-
        lishes and combines several state-of-the-
                                                                loguing (OpenMinTeD, 2019; OpenAire, 2019).
        art text mining services for named entities,
        events, and keywords recognition. NL-                      In this paper, we present a multi-domain text
        PHub makes the integrated methods com-                  mining system (NLPHub) that is compliant with
        pliant with Open Science requirements                   OS and combines multiple and heterogeneous pro-
        and manages heterogeneous access poli-                  cesses. NLPHub is based on an e-Infrastructure
        cies to the methods. In the paper, we as-               (e-I), i.e. a network of hardware and software
        sess the benefits and the performance of                resources that allow remote users and services
        NLPHub on the I-CAB corpus1 .                           to collaborate while supporting data-intensive
                                                                Science through cloud computing (Pollock and
1       Introduction                                            Williams, 2010; Andronico et al., 2011). Cur-
Today, text mining operates within the chal-                    rently, NLPHub integrates 30 state-of-the-art text
lenges introduced by big data and new Science                   mining services and methods to recognize frag-
paradigms, which impose to manage large vol-                    ments of a text (annotations) associated with
umes, high production rate, heterogeneous com-                  named abstract or physical objects (named enti-
plexity, and unreliable content, while ensuring                 ties), spatiotemporal events, and keywords. These
data and methods longevity through re-use in com-               integrated processes cover overall 5 languages
plex models and processes chains. Among the new                 (English, Italian, German, French, and Spanish),
paradigms, Open Science (OS) focusses on the                    requested by the European projects this software
implementation in computer systems of the three                 is involved in (i.e. (Parthenos, 2019; SoBigData,
"R"s of the scientific method: Reproducibility, Re-             2019; Ariadne, 2019)). These processes come
peatability, and Re-usability (Hey et al., 2009; EU             from different providers that have different ac-
Commission, 2016). The systems envisaged by                     cess policies, and the e-I is used both to man-
OS, are based on Web services networks that sup-                age this heterogeneity and to possibly speed up
port big data processing and the open publication               the processing through cloud computing. NL-
    1
                                                                PHub uses the Web Processing Service standard
     Copyright © 2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       (WPS, (Schut and Whiteside, 2007)) to describe
ternational (CC BY 4.0).                                        all integrated processes, and the Prov-O XML
ontological standard (Lebo et al., 2013) to track      ables folders creation and sharing between VRE
the complete set of input, output, and parameters      users. Through VREs and accounting and security
used for the computations (provenance). Overall,       services, D4Science is able to manage heteroge-
these features enable OS-compliance and we show        neous access policies by granting free access to
that the orchestration mechanism implemented by        open services in public VREs, and controlled/pri-
NLPHub adds effectiveness and efficiency to the        vate access to non-open services in private or mod-
connected methods. The name "NLPHub" refers            erated VREs. D4Science includes a cloud comput-
to the forthcoming extensions of this platform to      ing platform named DataMiner (Coro et al., 2015;
other text mining methods (e.g. sentiment analy-       Coro et al., 2017) that currently hosts ∼400 pro-
sis and opinion mining), and natural language pro-     cesses and makes all integrated processes avail-
cessing tasks (e.g. text-to-speech and speech pro-     able under the WPS standard (Figure 1). WPS
cessing).                                              is supported by third-party software and allows
                                                       standardising a process’ input, its parameterisa-
2   Methods and tools                                  tion and output. DataMiner executes the processes
                                                       in a cloud computing cluster of 15 machines with
2.1 E-Infrastructure and Cloud Computing
                                                       Ubuntu 16.04.4 LTS x86 64 operating system, 16
    Platform
                                                       virtual cores, 32 GB of RAM and 100 GB of disk
                                                       space. These machines are hosted by the National
                                                       Research Council of Italy and the Italian Network
                                                       of the University and Research (GARR). Each
                                                       process can parallelise an execution either across
                                                       the machines (using a Map-Reduce approach) or
                                                       on the cores of one single machine (Coro et al.,
                                                       2017). After each computation, DataMiner saves -
                                                       on the user’s Workspace- all the information about
                                                       the input and output data, and the experiment’s
                                                       parameters (computational provenance) using the
                                                       Prov-O XML standard. In each D4Science VRE,
                                                       DataMiner offers an online tool to integrate algo-
                                                       rithms, which supports many programming lan-
                                                       guages (Coro et al., 2016). All these features
                                                       make D4Science useful to develop OS-compliant
                                                       applications, because WPS and provenance track-
                                                       ing allow repeating and reproducing a computa-
Figure 1: Overall architectural schema of the NL-
                                                       tion executed by another user. Also, the possibility
PHub.
                                                       to provide a process in multiple VREs focussing
                                                       on different domains fosters its re-usability (Coro
   NLPHub uses the open-source D4Science e-            et al., 2017). In this paper, we will use the
I (Candela et al., 2013; Assante et al., 2019),        term "algorithm" to indicate processes running on
which currently supports applications in many do-      DataMiner, and "method" to indicate the original
mains through the integration of a distributed stor-   processes or services integrated with DataMiner.
age system, a cloud computing platform, online
collaborative tools, and catalogues of metadata        2.2 Annotations
and geospatial data. D4Science supports the cre-
ation of Virtual Research Environments (VREs)          NLPHub integrates a number of named entities
(Assante et al., 2016), i.e. Web-based environ-        recognizers (NERs) but also information extrac-
ments fostering collaboration and data sharing be-     tion processes that recognize events, keywords,
tween users and managing heterogeneous data and        tokens, and sentences. Overall, we will use
services access policies. D4Science grants each        the term "annotation" to indicate all the infor-
user with access to a private online file system       mation that NLPHub can extract from a text.
(the Workspace) that uses a high-availability dis-     The complete list of supported annotations, lan-
tributed storage system behind the scenes, and en-     guages, and processes is reported in the supple-
    mentary material, together with the list of all men-      (Aprosio and Moretti, 2016) was installed as a sep-
    tioned Web services’ endpoints. The ontologi-             arate service. Overall, two distinct replicated and
    cal classes used for NERs annotations come from           balanced virtual machines host these services on
    the Standford CoreNLP software. Included non-             machines with 10 GB of RAM and 6 cores.
    standard annotations are "Misc" (miscellaneous               GATE Cloud. GATE Cloud is a cloud ser-
    concepts that cannot be associated with none of           vice that offers on-payment text analysis methods
    the other classes, e.g. "Bachelor of Science"),           as-a-service (GATE Cloud, 2019a; Tablan et al.,
    "Event" (nouns, verbs, or phrases referring to a          2011). NLPHub integrates the GATE Cloud AN-
    phenomenon occurring at a certain time and/or             NIE NER for English, German, and French within
    space), and "Keyword" (a word or a phrase that            a controlled VRE that accounts for users’ requests
    is of great importance to understand the text con-        load. This VRE ensures a fair usage of the ser-
    tent).                                                    vices, whose access has been freely granted to
                                                              D4Science in exchange for enabling OS-oriented
    2.3 Integrated Text Mining Methods                        features (SoBigData European Project, 2016).
    NLPHub uses a common JSON format to repre-                   OpenNLP. The Apache OpenNLP library is an
    sent the annotations of every integrated method.          open source text processing toolkit mostly based
    This format describes the input text, the NER pro-        on machine learning models (Kottmann et al.,
    cesses, and the annotations for each NER:                 2011). An OpenNLP-based English NER is avail-
                                                              able as-a-service on GATE Cloud (GATE Cloud,
1   "text": "input text",
                                                              2019b) and is included among the free-to-use ser-
2      "N ER1 ": {
                                                              vices granted to D4Science.
3       "annotations":{
                                                                 ItaliaNLP. ItaliaNLP is a free-to-use ser-
4        "annotation1 ":[
                                                              vice - developed by the "Istituto di Linguistica
5         {"indices": [i1 ,i2 ]},
                                                              Computazionale" (ILC-CNR) - hosting a NER
6         {"indices": [i3 ,i4 ]},
                                                              method for Italian that combines rule-based and
7          ...,
                                                              machine learning algorithms (ILC-CNR, 2019;
       We integrated services and methods with                Dell’Orletta et al., 2014).
    DataMiner through "wrapping algorithms" that                 NewsReader. NewsReader is an advanced
    transformed the original outputs into this format.        events recognizer for 4 languages, developed by
    We implemented a general workflow in each al-             the NewsReader European project (Vossen et al.,
    gorithm to execute the corresponding integrated           2016). NewsReader is a formal inferencing sys-
    method, which adopts the following steps: (i) re-         tem that identifies events by detecting their partic-
    ceive an input text file and a list of entities to rec-   ipants and time-space constraints. Two balanced
    ognize (among those supported by the language),           virtual machines were installed in D4Science for
    (ii) pre-process the text by deleting useless char-       the English and Italian NewsReader versions.
    acters, (iii) encode the text with UTF-8 encod-              TagMe. TagMe is a service for identifying short
    ing, (iv) send the text via HTTP-Post to the corre-       phrases (anchors) in a text that can be linked to
    sponding service or execute the method on the lo-         pertinent Wikipedia pages (Ferragina and Scaiella,
    cal machine directly, if possible, and (v) return the     2010). TagMe supports 3 languages (English, Ital-
    annotation as an NLPHub-compliant JSON doc-               ian, and German) and D4Science already hosts its
    ument. In the following, we list all the methods          official instances. Since anchors are sequences of
    currently integrated with NLPHub with reference           words having a recognized meaning within their
    to Figure 1 for an architectural view.                    context, NLPHub interprets them as keywords that
       CoreNLP. The Stanford CoreNLP software                 can help contextualising and understanding the
    (Manning et al., 2014) is an open-source text             text.
    processing toolkit that supports several languages           Keywords NER. Keywords NER is an open-
    (Stanford University, 2019). NLPHub integrates            source statistical method that produces tags clouds
    CoreNLP as a service instance running within              of verbs and nouns (Coro, 2019a), which was also
    D4Science with English, German, French, and               used by the H-Care award-winning human digi-
    Spanish language packages enabled. Also, the              tal assistant (SpeechTEK 2010, 2019). Tag clouds
    Tint (The Italian NLP Tool) extension for Italian         are extracted through a statistical analysis of part-
of-speech (POS) tags (extracted with TreeTagger,       and the alignment algorithm manages all cases
(Schmid, 1995)) and the method can be applied to       through algebraic evaluations, as reported in the
all the 23 TreeTagger supported languages. Key-        following pseudo-code:
words NER is executed directly on the DataMiner
                                                   1   AMERGE Algorithm
machines, and the nouns tags are interpreted as
                                                   2
keywords for the NLPHub scopes, because - by
                                                   3   For each annotation E:
construction - their sequence is useful to under-
                                                   4    Collect all annotations detected
stand the topics treated by a text.
                                                            by the algorithms (intervals
   Language Identifier. NLPHub also provides                with text start and end
a language identification process (Coro, 2019b),           positions);
should language information not be specified as 5       Sort the intervals by their
input. This process was developed in order to be           start position;
fast, easily, and quickly extendible to new lan- 6      For each segment si :
guages. The algorithm is based on an empirical 7         If sj is properly included in
behaviour of TreeTagger (common to many POS                 si , process the next sj ;
taggers): When TreeTagger is initialised on a cer- 8     If si does not intersect sj ,
tain language, but it processes a text written in           brake the loop;
another language, it tends to detect many nouns 9        If si intersects sj , create a
and unstemmed words than verbs and other lexi-              new segment sui as the union
cal categories. Thus, the detected language is the          of the two segments →
one having the most balanced ratio of recognized            substitute sui to si and
and stemmed words with respect to other lexical             restart the loop on sj ;
categories. This algorithm is applicable to many 10     Save si in the overall list of
languages supported by TreeTagger and can run              merged intervals S;
on the DataMiner machines directly. An estimated 11     Associate S to E;
accuracy of 95% on 100 sample text files covering 12   Return all (E, S) pairs sets.
the 5 NLPHub languages was convincing to use
this algorithm as an auxiliary tool for the NLPHub        Since the AMERGE algorithm is a DataMiner
users.                                                 algorithm, it is published as-a-service with a
                                                       RESTful WPS interface. It represents one single
2.4 NLPHub                                             access point to the services integrated with NL-
                                                       PHub. In order to invoke this service, a client
On top of the methods and services described so
                                                       should specify an authorization code in the HTTP
far, we implemented an alignment-merging algo-
                                                       request that identifies both the invoking user and
rithm (AMERGE) that orchestrates the computa-
                                                       the VRE (CNR, 2016). The available annotations
tions and assembles their outputs. AMERGE re-
                                                       and methods depend on the VRE. An additional
ceives a user-provided input text, along with the
                                                       service (NLPHub-Info) allows retrieving the list
indication of the text language (optionally), and a
                                                       of supported entities for a VRE, given a user’s au-
set of annotations to be extracted (selected among
                                                       thorization code. NLPHub is also endowed with
those supported for that language). Then, it con-
                                                       a free-to-use Web interface (nlp.d4science.
currently invokes - via WPS - the text process-
                                                       org/hub/), based on a public VRE, operating on
ing algorithms that support the input request, and
                                                       top of the AMERGE process, which allows inter-
eventually collects the JSON documents coming
                                                       acting with the system and retrieving the annota-
from them. Finally, it aligns and merges the in-
                                                       tions in a graphical format.
formation to produce one overall sequence rep-
resented in JSON format. The issue of merging          3 Results
the heterogeneous connected services’ outputs is
solved through the use of the DataMiner wrapping       We assessed the NLPHub performance by us-
algorithms. Another solved issue is the merge of       ing the I-CAB corpus as a reference (Magnini et
the different intervals identified by several algo-    al., 2006), which contains annotations of the fol-
rithms focusing on the same entities. These inter-     lowing named-entities categories from 527 Ital-
vals may either overlap or be mutually inclusive,      ian newspapers: Person, Location, Organization,
                                     Person                                    Geopolitical                                   Location                                   Organization
    Algorithm       F-measure   Precision   Recall   Agreement   F-measure   Precision   Recall   Agreement   F-measure   Precision   Recall   Agreement   F-measure   Precision   Recall   Agreement
    ItaliaNLP         79%         74%       84%      Excellent     77%         74%       80%       Good         59%         52%       69%       Good         58%         52%       66%       Good
  CoreNLP-Tint        85%         78%       93%      Excellent     NA          NA        NA         NA          30%         18%       84%      Marginal      65%         53%       83%       Good
    AMERGE            84%         74%       96%      Excellent     77%         74%       80%       Good         31%         19%       88%      Marginal      63%         49%       87%       Good
  Keywords NER        20%         12%       56%      Marginal      14%         8%        66%      Marginal      6%          3%        58%      Marginal      22%         13%       66%      Marginal
      TagMe           23%         18%       30%      Marginal      33%         22%       67%      Marginal       9%         5%        42%      Marginal      25%         19%       38%      Marginal
AMERGE - Keywords     20%         12%       69%      Marginal      18%         10%       91%      Marginal       6%         3%        74%      Marginal      22%         13%       79%      Marginal




Table 1: Performance assessment of the NLPHub algorithms with respect to the I-CAB corpus annota-
tions.

Geopolitical entity. NLPHub was executed to                                                          4 Conclusions
annotate these same entities plus Keywords (Ta-
ble 1). The involved algorithms were CoreNLP-                                                        We have described NLPHub, a distributed sys-
Tint, ItaliaNLP, Keywords NER, and TagMe. Ac-                                                        tem connecting and combining 30 text processing
cording to the F-measure, CoreNLP-Tint was                                                           methods for 5 languages that adds Open Science-
the best at recognizing Persons and Organiza-                                                        oriented features to these methods. The advan-
tions, whereas ItaliaNLP - the only one supporting                                                   tages of using NLPHub are several, starting from
Geopolitical entities - had the highest performance                                                  the fact that it provides one single access end-
on Locations and a moderately-high performance                                                       point to several methods and spares installation
on Geopolitical entities. Overall, the connected                                                     and configuration time. Further, it proposes the
methods showed high performance on specific en-                                                      AMERGE process as a valid option when the best
tities, but there was not one method outperform-                                                     performing algorithm for a certain entity extrac-
ing the others on all entities. AMERGE had lower                                                     tion is not known a priori. Also, the AMERGE-
but good F-measure and a generally high recall in                                                    Keywords annotations can be used when the enti-
all cases, which indicates that the connected al-                                                    ties to extract are not known. Indeed, these fea-
gorithms include complementary and valuable in-                                                      tures would require more investigation, especially
tervals. The AMERGE-Keywords algorithm had                                                           through multiple-language experiments, in order
a generally high recall (especially on Geopoliti-                                                    to define their full potential and limitations. Fi-
cal entities), which means that the extracted key-                                                   nally, NLPHub adds to the original methods fea-
words include also words from the annotated enti-                                                    tures like WPS and Web interfaces, provenance
ties. The associated F-measures indicate that there                                                  management, results sharing, and access/usage
is overlap with several entities. In turn, this indi-                                                policies control, which make the methods more
cates that AMERGE-Keywords could be a valu-                                                          compliant the with Open Science requirements.
able source of information in the case of uncer-                                                        The potential users of NLPHub are scholars
tainty about the entities that can be extracted from                                                 who want to use NERs but also want to avoid soft-
a text. As a further evaluation, we used Cohen’s                                                     ware and hardware-related issues, or automatic
Kappa (Cohen, 1960) to explore the agreement                                                         agents that need to automatically extract and re-
between the algorithms and the I-CAB annota-                                                         use knowledge from large quantities of texts. For
tions. This measure required estimating the over-                                                    example, NLPHub can be used in automatic ontol-
all number of classifiable tokens, thus it is more                                                   ogy population and - since it also supports Events
realistic to refer to Fleiss’ Kappa macro classifica-                                                extraction - automatic narratives generation (Peta-
tions rather than to the exact values (Fleiss, 1971).                                                sis et al., 2011; Metilli et al., 2019). Future exten-
According to Fleiss’ labels, all NERs generally                                                      sions of NLPHub will involve other text mining
have good agreement with I-CAB except for Lo-                                                        methods (e.g. sentiment analysis, opinion mining,
cations, which are often reported as Geopolitical                                                    and morphological parsing), and additional NLP
entities in I-CAB. This evaluation also highlights                                                   tasks like text-to-speech and speech processing as-
that AMERGE has good general agreement with                                                          a-service.
manual annotations, and thus can be a valid choice
when there is no prior knowledge about the algo-
                                                                                                     Supplementary Material
rithm to use for extracting a certain entity.
                                                                                                     Supplementary material is available on D4Science
                                                                                                     at this permanent hyper-link.
 References                                                 [Coro et al.2015] Gianpaolo Coro, Leonardo Candela,
                                                               Pasquale Pagano, Angela Italiano, and Loredana
[Adedugbe et al.2018] Oluwasegun Adedugbe, Elhadj              Liccardo. 2015. Parallelizing the execution of na-
   Benkhelifa, and Russell Campion. 2018. A cloud-             tive data mining algorithms for computational biol-
   driven framework for a holistic approach to semantic        ogy. Concurrency and Computation: Practice and
   annotation. In 2018 Fifth International Conference          Experience, 27(17):4630–4644.
   on Social Networks Analysis, Management and Se-
   curity (SNAMS), pages 128–134. IEEE.                     [Coro et al.2016] Gianpaolo Coro, Giancarlo Panichi,
                                                               and Pasquale Pagano. 2016. A web application to
[Amado et al.2018] Alexandra Amado, Paulo Cortez,
                                                               publish r scripts as-a-service on a cloud computing
   Paulo Rita, and Sérgio Moro. 2018. Research trends
                                                               platform. Bollettino di Geofisica Teorica ed Appli-
   on big data in marketing: A text mining and topic
                                                               cata, 57:51–53.
   modeling based literature analysis. European Re-
   search on Management and Business Economics,
   24(1):1–7.                                               [Coro et al.2017] Gianpaolo Coro, Giancarlo Panichi,
                                                               Paolo Scarponi, and Pasquale Pagano. 2017. Cloud
[Andronico et al.2011] Giuseppe Andronico, Valeria             computing in a distributed e-infrastructure using
   Ardizzone, Roberto Barbera, Bruce Becker, Ric-              the web processing service standard. Concur-
   cardo Bruno, Antonio Calanducci, Diego Carvalho,            rency and Computation: Practice and Experience,
   Leandro Ciuffo, Marco Fargetta, Emidio Giorgio,             29(18):e4219.
   et al. 2011. e-infrastructures for e-science: a global
   view. Journal of Grid Computing, 9(2):155–184.           [Coro2019a] Gianpaolo Coro.   2019a.    The
                                                               Keywords Tag Cloud Algorithm.     https:
[Aprosio and Moretti2016] Alessio Palmero Aprosio              //svn.research-infrastructures.
   and Giovanni Moretti. 2016. Italy goes to stanford:         eu/public/d4science/gcube/
   a collection of corenlp modules for italian. arXiv          trunk/data-analysis/
   preprint arXiv:1609.06204.                                  LatentSemanticAnalysis/.

[Ariadne2019] Ariadne.  2019.  The Ari-                     [Coro2019b] Gianpaolo Coro. 2019b. The Language
    adnePlus European Project.    https:                       Identifier Algorithm. hyper-link.
    //ariadne-infrastructure.eu/.
                                                            [Dell’Orletta et al.2014] Felice Dell’Orletta, Giulia
[Assante et al.2016] Massimiliano Assante, Leonardo            Venturi, Andrea Cimino, and Simonetta Monte-
   Candela, Donatella Castelli, Gianpaolo Coro, Lu-            magni. 2014. T2kˆ 2: a system for automatically
   cio Lelii, and Pasquale Pagano. 2016. Virtual re-           extracting and organizing knowledge from texts.
   search environments as-a-service by gcube. PeerJ            In Proceedings of the Ninth International Con-
   Preprints, 4:e2511v1.                                       ference on Language Resources and Evaluation
                                                               (LREC-2014).
[Assante et al.2019] Massimiliano Assante, Leonardo
   Candela, Donatella Castelli, Roberto Cirillo, Gian-      [EU Commission2016] EU Commission.   2016.
   paolo Coro, Luca Frosini, Lucio Lelii, Francesco            Open science (open access).     https:
   Mangiacrapa, Valentina Marioli, Pasquale Pagano,            //ec.europa.eu/programmes/
   et al. 2019. The gcube system: Delivering virtual           horizon2020/en/h2020-section/
   research environments as-a-service. Future Genera-          open-science-open-access.
   tion Computer Systems, 95:445–453.
                                                            [Ferragina and Scaiella2010] Paolo Ferragina and Ugo
[Bontcheva and Derczynski2016] Kalina     Bontcheva
                                                                Scaiella. 2010. Tagme: on-the-fly annotation of
   and Leon Derczynski. 2016. Extracting information
                                                                short text fragments (by wikipedia entities). In Pro-
   from social media with gate. In Working with Text,
                                                                ceedings of the 19th ACM international conference
   pages 133–158. Elsevier.
                                                                on Information and knowledge management, pages
[Candela et al.2013] Leonardo Candela, Donatella                1625–1628. ACM.
   Castelli, Gianpaolo Coro, Pasquale Pagano, and
   Fabio Sinibaldi. 2013. Species distribution model-       [Fleiss1971] Joseph L Fleiss. 1971. Measuring nomi-
   ing in the cloud. Concurrency and Computation:               nal scale agreement among many raters. Psycholog-
   Practice and Experience.                                     ical bulletin, 76(5):378.

[CNR2016] CNR. 2016. gcube wps thin clients.                [Gandomi and Haider2015] Amir Gandomi and Mur-
   https://wiki.gcube-system.org/                              taza Haider. 2015. Beyond the hype: Big data con-
   gcube/How_to_Interact_with_the_                             cepts, methods, and analytics. International Journal
   DataMiner_by_client.                                        of Information Management, 35(2):137–144.

[Cohen1960] Jacob Cohen. 1960. A coefficient of             [GATE Cloud2019a] GATE Cloud. 2019a. GATE
   agreement for nominal scales. Educational and psy-          Cloud: Text Analytics in the Cloud. https://
   chological measurement, 20(1):37–46.                        cloud.gate.ac.uk/.
[GATE Cloud2019b] GATE Cloud. 2019b. OpenNLP                    Knowledge-driven multimedia information ex-
   English Pipeline. https://cloud.gate.                        traction and ontology evolution, pages 134–166.
   ac.uk/shopfront/displayItem/                                 Springer-Verlag.
   opennlp-english-pipeline.
                                                             [Pollock and Williams2010] Neil Pollock and Robin
[Hey et al.2009] Tony Hey, Stewart Tansley, Kristin M            Williams. 2010. E-infrastructures: How do we
   Tolle, et al. 2009. The fourth paradigm: data-                know and understand them? strategic ethnography
   intensive scientific discovery, volume 1. Microsoft           and the biography of artefacts. Computer Supported
   research Redmond, WA.                                         Cooperative Work (CSCW), 19(6):521–556.

[ILC-CNR2019] ILC-CNR. 2019. The ItaliaNLP                   [Schmid1995] Helmut Schmid. 1995. Treetagger - a
   REST Service. http://api.italianlp.it/                        language independent part-of-speech tagger. Insti-
   docs/.                                                        tut für Maschinelle Sprachverarbeitung, Universität
                                                                 Stuttgart, 43:28.
[Kottmann et al.2011] J Kottmann, B Margulies, G In-
   gersoll, I Drost, J Kosin, J Baldridge, T Goetz,          [Schut and Whiteside2007] Peter Schut and A White-
   T Morton, W Silva, A Autayeu, et al. 2011. Apache             side.   2007.    OpenGIS Web Processing Ser-
   OpenNLP. www.opennlp.apache.org.                              vice.   OGC project document http://www.
                                                                 opengeospatial.org/standards/wps.
[Lebo et al.2013] Timothy Lebo, Satya Sahoo, Debo-
   rah McGuinness, Khalid Belhajjame, James Cheney,          [SoBigData European Project2016] SoBigData    Eu-
   David Corsar, Daniel Garijo, Stian Soiland-Reyes,            ropean Project.    2016.    Deliverable D2.7 -
   Stephan Zednik, and Jun Zhao. 2013. Prov-o: The              IP principles and business models.      http:
   prov ontology. W3C Recommendation, 30.                       //project.sobigdata.eu/material.

[Linthicum2017] David S Linthicum. 2017. Cloud               [SoBigData2019] SoBigData. 2019. The SoBigData
    computing changes data integration forever: What’s          European Project. http://sobigdata.eu/
    needed right now. IEEE Cloud Computing, 4(3):50–            index.
    53.
                                                             [SpeechTEK 20102019] SpeechTEK 2010.  2019.
[Magnini et al.2006] Bernardo Magnini, Emanuele Pi-             SpeechTEK 2010 - H-Care Avatar wins Peo-
   anta, Christian Girardi, Matteo Negri, Lorenza Ro-           ple’s Choice Award. http://web.archive.
   mano, Manuela Speranza, Valentina Bartalesi, and             org/web/20160919100019/http://www.
   Rachele Sprugnoli. 2006. I-cab: the italian content          speechtek.com/europe2010/avatar/.
   annotation bank. In LREC, pages 963–968. Citeseer.
                                                             [Stanford University2019] Stanford University. 2019.
[Manning et al.2014] Christopher Manning, Mihai Sur-             Stanford CoreNLP - Human Languages Sup-
   deanu, John Bauer, Jenny Finkel, Steven Bethard,              ported.   https://stanfordnlp.github.
   and David McClosky. 2014. The stanford corenlp                io/CoreNLP/.
   natural language processing toolkit. In Proceedings
   of 52nd annual meeting of the association for com-        [Tablan et al.2011] Valentin Tablan, Ian Roberts,
   putational linguistics: system demonstrations, pages          Hamish Cunningham, and Kalina Bontcheva.
   55–60.                                                        2011. GATE Cloud.net: Cloud Infrastructure for
                                                                 Large-Scale, Open-Source Text Processing. In UK
[Metilli et al.2019] Daniele Metilli, Valentina Bartalesi,       e-Science All hands Meeting.
   and Carlo Meghini. 2019. Steps towards a system
   to extract. In Proceedings of the Text2Story 2019         [Vossen et al.2016] Piek Vossen, Rodrigo Agerri, Itziar
   Workshop, page na. Springer.                                 Aldabe, Agata Cybulska, Marieke van Erp, Antske
                                                                Fokkens, Egoitz Laparra, Anne-Lyse Minard,
[OpenAire2019] OpenAire. 2019. European project                 Alessio Palmero Aprosio, German Rigau, et al.
   supporting Open Access.     https://www.                     2016. Newsreader: Using knowledge resources
   openaire.eu/.                                                in a cross-lingual reading machine to generate
                                                                more knowledge from massive streams of news.
[OpenMinTeD2019] OpenMinTeD.    2019. Open                      Knowledge-Based Systems, 110:60–85.
   Mining INfrastructure for TExt and Data.
   https://cordis.europa.eu/project/                         [Wei et al.2016] Chih-Hsuan Wei, Robert Leaman, and
   rcn/194923/factsheet/en.                                     Zhiyong Lu. 2016. Beyond accuracy: creating in-
                                                                teroperable and scalable text-mining web services.
[Parthenos2019] Parthenos.    2019.                 The         Bioinformatics, 32(12):1907–1910.
    Parthenos European Project.                   http:
    //www.parthenos-project.eu/.

[Petasis et al.2011] Georgios     Petasis,     Vangelis
    Karkaletsis, Georgios Paliouras, Anastasia Krithara,
    and Elias Zavitsanos.      2011.    Ontology pop-
    ulation and enrichment: State of the art.         In