<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MultiScien: a Bi-Lingual Natural Language Processing System for Mining and Enrichment of Scienti c Collections</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Horacio Saggion</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Ronzano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Accuosto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Ferres</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Large-Scale Text Understanding Systems Lab TALN Research Group Department of Information and Communication Technologies Universitat Pompeu Fabra C/Tanger 12</institution>
          ,
          <addr-line>08018 Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the current online Open Science context, scienti c datasets and tools for deep text analysis, visualization and exploitation play a major role. We present a system for deep analysis and annotation of scienti c text collections. We also introduce the rst version of the SEPLN Anthology, a bi-lingual (Spanish and English) fully annotated text resource in the eld of natural language processing that we created with our system. Moreover, a faceted-search and visualization system to explore the created resource is introduced. All resources created for this paper will be available to the research community.</p>
      </abstract>
      <kwd-group>
        <kwd>Language Resources</kwd>
        <kwd>Scienti c Text Corpora</kwd>
        <kwd>Information Extraction</kwd>
        <kwd>Data Visualization</kwd>
        <kwd>Semantic Analysis</kwd>
        <kwd>PDF Conversion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Scienti c articles are among the most valuable textual records of human
knowledge. In the past few years the volume of scienti c publications made available
online has grown exponentially, opening interesting research avenues for Natural
Language Processing (NLP), Information Retrieval, and Data Visualization.</p>
      <p>
        In the current online Open Science context [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the availability of scienti c
datasets and tools for deep text analysis, visualization and exploitation plays a
major role. Access to recent and past scienti c discoveries, methods, and
techniques is essential for scienti c inquiry activities which include, among others:
(i) nding open or solved problems, (ii) understanding research elds' dynamics
or evolution, (iii) discovering experts in speci c scienti c areas, or (iv)
understanding the advantages and limitations of current solutions.
      </p>
      <p>
        In recent years, a number of initiatives have emerged to make scienti c
articles accessible as structured text corpora: notable examples include the ACL
Anthology network [23], the Darmstadt Scienti c Text Corpus [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] or the TALN
Archives [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. At the same time, e orts to make publications available through
scienti c portals have proliferated with CiteSeer [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Google Scholar [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
Microsoft Academics [29], or DBLP [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] as some of the best known. However,
and to the best of our knowledge, most of these initiatives do not expose rich
linguistic annotations and only perform limited analyses of the content of the
scienti c document, which mainly deal with the extraction of the document's
logical structure and layout and the indexing of contents. Papers available as PDF
documents are indeed often processed to extract relevant meta-data such as
authors, titles and abstracts, addressing in some cases author disambiguation and
the extraction of citations and citations' sentences for citations network building.
      </p>
      <p>In order to exploit the rich content of scienti c collections, we have developed
a system for deep analysis of research papers. The system integrates text parsing,
coreference resolution, bibliography enrichment, word sense disambiguation and
entity linking, rhetorical sentence classi cation, citation purpose identi cation,
open information extraction, and text summarization. Moreover, the tool is able
to deal with texts in English and Spanish. For this research, we have applied
our system to create the initial version of the Anthology of the Spanish Society
for Natural Language Processing1{the SEPLN Anthology: a bi-lingual (Spanish
/ English) text resource in the eld of Natural Language Processing built by
extracting structured textual contents and performing linguistic and semantic
analyses of the research articles published by SEPLN over the last years. The
annotated textual contents of the SEPLN Anthology are freely available to the
NLP communities in order to boost research and experimentation in the eld of
scienti c text mining.</p>
      <p>In order to support intelligent access to the collection, the documents,
metadata and linguistic information of this resource have been properly indexed and
made available in a web platform to o er the possibility of interactively explore
the knowledge from the SEPLN universe through faceted searches.</p>
      <p>We summarize the contributions of this paper as follows:
{ The rst bi-lingual scienti c text analysis system which is specially adapted
to SEPLN and other types of scienti c publications;
{ The rst version of the SEPLN Anthology, including SEPLN articles from
2008 to 2016 and annotated with meta-data and rich linguistic information;
{ A faceted-search and visualization interface to explore the SEPLN universe.</p>
      <p>The rest of the paper is organized as follows: in Section 2 we describe the
scienti c text mining tools and resources we have implemented and exploited
to extract and process the contents of SEPLN articles so as to build the rst
version of the SEPLN Anthology corpus. In Section 3 we present the web portal
we developed to perform innovative faceted browsing of the SEPLN papers.
After brie y reporting related works (Section 4), the paper is concluded with
Section 5 where we present our future avenues of research by outlining how to
improve the SEPLN Anthology as well as by anticipating the possibilities that</p>
    </sec>
    <sec id="sec-2">
      <title>1 http://www.sepln.org/</title>
      <p>this resource o ers in terms of boosting the involvement of NLP community in
scienti c text mining by means, for instance, of possible content analysis and
visualization challenges.
2</p>
      <sec id="sec-2-1">
        <title>Building the SEPLN Anthology</title>
        <p>Focusing on the promotion of research in the eld of NLP for the past 33 years,
the SEPLN has systematically published research articles in both its annual
conference proceedings and the SEPLN journal. By allowing the articles to be
written in Spanish (as well as in English), SEPLN has also played an important
role in fostering NLP research and linguistic technologies targeted to Spanish
and in making them accessible to the Spanish-speaking scienti c community.</p>
        <p>The fact that, over the past few years, each SEPLN article contains titles and
abstracts in both Spanish and English constitutes one relevant characteristic not
only to boost dissemination, but also to enable new experimentation in di erent
areas of NLP. Both the presence of bi-lingual contents and the peculiar style of
SEPLN articles pose speci c content extraction and analysis challenges that we
address in this work.</p>
        <p>In a nutshell, the building of the SEPLN Anthology started by crawling the
DBLP Computer Science repository in order to collect the meta-data associated
to each research paper (e.g. the bibtex entries) and to retrieve the link to the
o cial open access PDF publication for further processing. In the following
sections we detail the set of steps we have applied to extract, analyze and enrich
the contents of each article in order to support intelligent access to the SEPLN
archive. In particular, after a brief overview of our initial dataset in Section 2.1,
we describe how we converted SEPLN PDF articles into structured XML
documents (Section 2.2). Then we explain how we perform linguistic and semantic
analyses of the contents of each article (Section 2.3) and how we enrich these
contents by means of metadata retrieved from online web services and knowledge
resources (Section 2.4). The textual content of the SEPLN articles of the
Anthology, together with the results of their linguistic annotation and enrichment, can
be downloaded at the following URL: http://backingdata.org/seplndata/.2
2.1</p>
        <sec id="sec-2-1-1">
          <title>The SEPLN Journal Dataset</title>
          <p>2 Annotated papers were saved and made available in the GATE XML data format
(https://gate.ac.uk/sale/tao/splitch5.html##x8-960005.5.2).
3 http://journal.sepln.org
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Converting PDF to XML</title>
          <p>
            Even if the adoption of XML-based formats in scienti c publishing is
growing considerably, more than 80% of the scienti c literature is available as PDF
documents. As a consequence, the possibility to consistently extract structured
textual content from PDF les constitutes an essential step to bootstrap any
scienti c text mining process. According to recent evaluations [
            <xref ref-type="bibr" rid="ref16">16, 30</xref>
            ] the best
performing tools to extract structured contents from PDF articles include:
{ GROBID:4 a Java library that exploits a chain of conditional random eld
(CRF) sequence taggers to identify the correct semantic class of the textual
contents of scienti c articles;
{ CERMINE:5 a Java library that relies on both unsupervised and supervised
algorithms to spot a rich set of structural features of scienti c publications
in PDF format;
{ PDFX:6 an online web service that applies a set of layout and content based
rules to extract structural textual elements from PDF papers (including title,
abstract, bibliographic entries, etc.).
          </p>
          <p>In general, existing tools extract structured contents from PDF articles by
generating XML les, sometimes complemented by HTML visualizations of the
same content with a di erent layout. Therefore, when a PDF publication is
processed, its original textual layout is completely lost and when the information
extracted from a paper is displayed as an HTML page (e.g. to highlight the
occurrences of speci c words or spot the excerpts presenting the challenges faced
by the authors) users are disoriented by the loss of any reference to the layout
of the original document.</p>
          <p>We addressed this issue by implementing PDFdigest, an innovative
application that extracts structured textual contents from scienti c articles in PDF
format. The output of PDFdigest conversion is a pair of XML and HTML
documents. While the XML document is useful to save the structural elements
identi ed in the paper (title, authors, abstract, etc.), the HTML document
includes the contents of the paper preserving its original layout. Moreover, each
element identi ed by the markup of the XML document is mapped to the list
of identi ers of the DIV elements holding the corresponding textual content in
the HTML document. As a result, interactive, layout-preserving HTML-based
visualizations of a single paper can be generated, as illustrated in Section 3.</p>
          <p>PDFdigest is a Java-based application tailored to deal with both one-column
and two-column layouts of PDF articles with textual contents expressed in one
or several languages (as SEPLN articles combine Spanish and English).</p>
          <p>PDFdigest is able to spot the following core set of structural elements of
scienti c articles: title, second title (in case it exists), authors' names, a liation
and email, abstract(s), categories, keyword(s), sections' titles and textual content,</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 http://github.com/kermitt2/grobid 5 http://cermine.ceon.pl/ 6 http://pdfx.cs.man.ac.uk/</title>
      <p>acknowledgements and bibliographic entries. PDFdigest can also extract other
elds, including annexes, authors' biographies, gure and table captions, abstract
title and keywords title, among others.</p>
      <p>The content extraction pipeline implemented by PDFdigest consists of the
following ve phases: (i) PDF to HTML conversion, (ii) computation of
statistics of HTML tags and CSS elements, (iii) rule-based content detection and
extraction, (iv) language prediction and, (v) XML generation.</p>
      <p>In the rst phase{PDF to HTML conversion{we use pdf2htmlEX,7 obtaining
an HTML document that includes DIV elements de ning the position and style
of small portions of the paper, preserving the paper's original layout by means
of CSS properties.</p>
      <p>The following phase{computation of structural elements statistics{relies on
the JSoup8 HTML parsing library, which exploits the HTML tags and CSS
properties to extract textual contents and identify their semantics (title, abstract,
keywords, etc.). Based on this information, PDFdigest computes statistics of
the HTML tags and CSS properties used in the document (i.e. the most
frequently used font types and size, etc.) and, then, a rule-based extraction phase
iterates over the HTML tags and applies a complex set of manually-generated
rules to detect and retrieve the structural elements of the paper (i.e. title,
abstract, acknowledgements). Speci c extraction rules and analysis procedures are
implemented for each type of structural element based on the computed layout
statistics, the structural markers previously detected, and a set of
languagedependent and content-speci c regular expressions that can be manually
modied or extended.</p>
      <p>The nal phase performs language prediction of the textual contents of each
structural element{to identify, for instance, English and Spanish versions of the
title, abstract and keywords{and generates the output le in XML format. A
set of language probabilities are calculated individually for each structural
element and globally for the whole textual content extracted from each section of
the article. The language prediction is computed using the optimaize language
detector9 Java API.</p>
      <p>In a post-processing step, the generated XML is validated by means of the
JTidy10 and SAX11 Java parsers.</p>
      <p>PDFdigest is highly customizable and can be easily adapted to di erent styles
of scienti c articles and languages by modifying the language-dependent regular
expressions, the o set thresholds, and (in some special cases) the nite states
that de ne the logical structure of the paper and the extraction and consumption
rules themselves.</p>
      <p>We evaluated the structured textual content extraction quality of PDFdigest
by creating a gold standard set of 30 SEPLN articles manually annotated with
7 http://github.com/coolwanglu/pdf2htmlEX
8 http://jsoup.org/
9 http://github.com/optimaize/language-detector
10 http://jtidy.sourceforge.net/
11 http://www.saxproject.org/
respect to eight types of strctural elements: title(s), abstract(s), list of keywords,
section headers (up to a depth of three levels), paragraphs, table captions, gure
captions and bibliographic entries. Among all the annotations of structural
elements generated by PDFdigest, in our evaluation we considered as true positive
items only those annotations spotting the same structural element and covering
an identical text span than a gold standard annotation. Over all the structural
elements considered, PDFdigest obtained an weighted average F1 score equal
to 0.917. The most common content extraction errors, besides tables and gure
captions, are due to skipped section titles and bibliographic entries.
2.3</p>
      <sec id="sec-3-1">
        <title>Linguistic Analysis</title>
        <p>We process the structured textual content extracted from SEPLN Journal Papers
in PDF format (see Section 2.2) by means of a customized version of the Dr.
Inventor Text Mining Framework [24, 25].12 We distribute the results of these
analyses as textual annotations of the corpus of analyzed SEPLN articles.</p>
        <p>We identify sentences in the abstracts and the paragraphs of each paper
by means of a sentence splitter customized to scienti c publications, thus by
properly dealing with expressions like: i.e., et. al., Fig., Tab., etc.</p>
        <p>Then we spot tokens inside the textual content extracted from each paper
by means of a rule-based language-independent tokenizer developed by relying
on ANNIE, the Information Extraction toolbox integrated in GATE [19].</p>
        <p>
          Thanks to a set of JAPE rules13 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] customized to the speci c citation formats
of SEPLN Papers, we identify inline citations in the sentences of each article;
then we apply a set of heuristics to link each inline citation to the referenced
bibliographic entry. By relying on lexical match rules, we mark inline citations
that have a syntactic role in the sentence (e.g. in 'Rossi et al. (2015) discovered
that...', the inline citation 'Rossi et al. (2015)' is the subject of the sentence).
        </p>
        <p>
          We exploit the MATE tools14 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to perform both POS-tagging and
dependency parsing of publications. Since SEPLN Journal Papers present mixed
English and Spanish textual contents, for each text excerpt to analyze (sentence,
title, etc.) we rely on the language identi ed by the PDF-to-text converter to
properly select the language-speci c POS-tagger and dependency parser to use.
When we apply dependency parsing, we consider as a single token only inline
citations that have a syntactic role in the sentence where they occur, ignoring
the other ones.
        </p>
        <p>In order to make explicit the rhetorical organization of papers, we classify
each sentence as belonging to one of the following categories: Approach,
Challenge, Background, Outcomes and Future Work.</p>
        <p>
          To this purpose, by exploiting the Weka machine learning platform15 [33],
we trained a logistic regression classi er over the Dr. Inventor Multi-Layer
Scienti c Corpus [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], a colection of 40 English articles in which each sentence has
12 http://driframework.readthedocs.io/
13 http://gate.ac.uk/sale/tao/splitch8.html
14 http://code.google.com/p/mate-tools/
15 http://www.cs.waikato.ac.nz/ml/weka/
been manually assigned a speci c rhetorical category. To enable automated
classi cation, we model each sentence by means of a set of linguistic and semantic
features, most of them extracted by relying on the results of the textual analyses
described in this Section. We identi ed the rhetorical category of all the English
excerpts of SEPLN Journal Papers,16 leaving as future work the extension of
the classi er to Spanish texts.
        </p>
        <p>By applying a set of 84 JAPE rules formulated and iteratively re ned by
analyzing a collection of 40 English scienti c articles, we spot causal relations
inside the English content of SEPLN Papers. Each causal relation is composed
of two text excerpts respectively spotting the cause and the related e ect.</p>
        <p>
          We perform coreference resolution over the English contents of SEPLN papers
by applying a deterministic approach similar to the one proposed by the Stanford
Coreference Resolution System17 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. In particular, when building coreference
chains we match nominal and pronominal coreferent candidates by considering
exact matches, pronominal matches, appositions and predicate nominative.
        </p>
        <p>
          We support the generation of extractive summaries of SEPLN articles by
ranking the sentences of each paper with respect to their relevance to be
included in a summary of the same publications. To this purpose we associate
to each sentence a summary relevance score: the higher the value of such score
is, the more suitable is the sentence to be part of a summary of the paper. In
particular, we determine two types of summary relevance scores for each
sentence by relying respectively on the TF-IDF similarity of the sentence with the
title of the paper [27] and by applying LexRank, a graph-based summarization
algorithm [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In LexRank we computed the TF-IDF similarity among pairs of
sentences by considering the IDF scores of tokens derived from the English and
Spanish Wikipedias (2015 dumps). Other methods including centroid and term
frequency are implemented based on an available summarization library [27].
2.4
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Enriching Papers</title>
        <p>Besides the linguistic analyses described in the previous Section (2.3), we exploit
a set of online resources and web services in order to further enrich the contents
of the papers, thus enabling richer navigation and visualization possibilities.
Mining bibliographic metadata We retrieve from DBLP the bibliographic
metadata of each SEPLN paper, thus collecting the related BibTeX records. To
get structured metadata describing the bibliographic entries extracted from each
SEPLN article, we rely on three web services:
{ Bibsonomy:18 thanks to the web API of Bibsonomy we are able to retrieve
the BibTeX metadata of a bibliographic entry if present in the Bibsonomy
database;
16 All SEPLN Papers have both English and Spanish abstracts and about half of them
use English in their body.
17 http://nlp.stanford.edu/software/dcoref.shtml
18 http://bitbucket.org/bibsonomy/bibsonomy/wiki/browse/documentation/api/
{ CrossRef:19 thanks to the bibliographic link web API of CrossRef we can
match a wide variety of free-form citations to DOIs if present in CrossRef;
{ FreeCite:20 this online tool analyzes the text of references by relying on
a conditional random eld sequence tagger trained on the CORA dataset,
made of 1,838 manually tagged bibliographic entries21.</p>
        <p>By giving precedence to metadata retrieved from Bibsonomy over CrossRef
and Freecite outputs (because of their higher accuracy), it is possible to merge
the results retrieved by querying these three REST endpoints, trying to
determine for each bibliographic entry the title of the paper, the year of publication,
the list of authors, and the venue or journal of publication.</p>
        <p>To better characterize the semantics of SEPLN articles, we disambiguate
their contents by means of the Babelfy web service22 [21]. Babelfy spots the
occurrences of concepts and Named Entities inside the text of each paper and
links them to their right meaning chosen in the sense inventory of Babelnet.
Author a liation identi cation Our PDF-to-XML conversion approach
(introduced in Section 2.2) manages to extract from the header of papers the names
and emails of the authors together with the text describing their a liations. We
parse the text of each a liation by both the Google Geocoding and the
DBpedia Spotlight web services in order to try to unambiguously determine the
mentioned organization (university, institute, company) together with the city
and state where it is located.</p>
        <p>{ Google Geocoding:23 useful to identify the name of the institution and its
normalized address;
{ DBpedia Spotlight:24 exploited to identify instances of the following types
from the DBpedia ontology: Organization, Country and City.</p>
        <p>The metadata added to describe each a liation can be properly merged to
try to determine the name and address of the organization referred as well as its
geolocation, thus enabling the possibility to explore the geographic dimension of
the corpus of SEPLN papers so as to visualize, for instance, the yearly changes
of the geographic distribution of the authors who published SEPLN articles.</p>
        <p>In addition, we exploit the association of the authors to their email addresses
to identify their a liations when it cannot be done solely based on the
information retrieved from the Google and DBPedia services due, for instance, of the
presence of multiple institutions in the a liation text or when there is amibiguity
in the identi cation of the entity that corresponds to the author a liation.
19 http://search.crossref.org/help/api/
20 http://freecite.library.brown.edu/
21 http://hpi.de/naumann/projects/repeatability/datasets/cora-dataset.html
22 http://babelfy.org/
23 http://developers.google.com/maps/documentation/geocoding/intro/
24 http://demo.dbpedia-spotlight.org/</p>
        <p>We evaluated the quality of the automated assignment of institutions to
authors by manually checking a randomly-selected subset including 10% of the
1,585 author-paper instances available in the corpus, obtaining a precision value
of 0.83 for the unique identi cation of the a liations.
3</p>
        <sec id="sec-3-2-1">
          <title>Visualizing SEPLN Articles</title>
          <p>
            The results of SEPLN papers analysis and enrichment are made openly available
on a visualization platform accessible at: http://backingdata.org/sepln/. For
a use case of the dataset and platform the reader is referred to [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ].
3.1
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>The Web Visualization Platform</title>
        <p>User-friendliness was one of the main goals in the design of the visualization
platform. When rst accessing it, the user nds a sortable table with basic
metadata of the papers of the SEPLN Anthology: title, authors and publication
year. On a sidebar, a full-text search box can be used to retrieve documents
based on their title, keywords, abstract sentences, author names or a liations.</p>
        <p>Filter elds are populated dynamically with values retrieved from the
indexed content, making faceted searches available for an incremental exploration
of the corpus. When selected, lters are applied to the search engine calls and
immediately re ected in the visualizations o ered by the platform. Currently,
lters are available for keywords, authors, a liations, topics, Babelnet synsets,
countries, cities and publication years.
The HTML version of the paper produced by PDFdigest is made available with
an added layer of interactivity that allows the user to explore the rhetorical
categories assigned to the paper's sentences, as well as the sentences identi ed
by the summary ranking algorithms (Fig. 1).
As storage and search solution for the analyzed papers we use Elasticsearch,25
an open-source application developed in Java and based on the Apache Lucene26
search engine library. The availability of Elasticsearch clients for the most
popular programming languages makes it suitable for our needs as it provides a great
exibility in terms of both indexing and accessing the data. In our project we
use version 5.2 of Elasticsearch and its corresponding Java API.27</p>
        <p>Elasticsearch is distributed together with Kibana,28 a web-based platform for
the analysis and visualization of indexed data. Our web visualization platform
takes advantage of Kibana's possibilities for the generation of visualizations but
provides a layer on top of it in order to simplify the process of data exploration
by means of custom lter and search functionalities.</p>
        <p>The following four visualizations currently integrated into our platform
showcase this possibility: (i) Date histogram of published papers; (ii) Heatmap of
keywords used throughout the years; (iii) Word cloud of keywords; (iv) Heatmap
with the evolution of concepts through rhetorical categories over the years.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Geo-spatial Visualizations</title>
        <p>Geo-spatial visualization of data makes it possible to better identify relationships
for which it can be di cult to get a clear grasp just from tables or graphics. We
provide geo-visualizations based on choropleth maps of the world and Spain in
which countries (provinces, in the case of Spain) are shaded in proportion to
the number of publications available from those countries or provinces, as show
in Fig. 2. The di erent geographic regions are linked by arcs that graphically
represent the existence of collaborations among authors a liated to institutions
in the corresponding geographical spaces.</p>
        <p>The data used to generate these visualizations is dynamically retrieved from
the Elasticsearch index by means of its JavaScript client.29</p>
        <p>
          Scalable vector graphics representations of the maps are enriched with data
and made responsive to events triggered by the user by means of the D3.js
library.30
25 http://www.elastic.co/
26 http://lucene.apache.org/
27 http://elastic.co/guide/en/elasticsearch/client/java-api/
28 http://www.elastic.co/products/kibana/
29 http://elastic.co/guide/en/elasticsearch/client/javascript-api/
30 http://d3js.org/
One of the most comprehensive NLP-related corpora is the ACL Anthology [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
which includes all papers published under the ACL umbrella. The resource was
transformed into the ACL Anthology Network (ANN) [23] to enrich the
anthology with citation networks information. ANN has been used in di erent NLP
tasks, notably summarization and citation categorization. It was also featured
in several exploratory use cases including the Action Science Explorer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] to
investigate the evolution of the eld of dependency parsing and in [18] to study,
among other elements, author collaborations, citations, and funding agencies in
the LREC universe. Concerning tools for scienti c text exploration and access, in
addition to generic scienti c document retrieval systems such as Google Scholar
or Microsoft Academics, we can mention the ACL Anthology Searchbench
system [28] which performs linguistic analysis of the documents and allows the user
exploration of the ACL Anthology by means of predicate argument queries and
other lter types, and Sa ron [20] and Rexplore [22] which are open-domain but
perform a reduced set of NLP analyses over publications.
        </p>
        <p>During the last few years tasks related to the extraction, summarization and
modeling of information from scienti c publications have been proposed in the
context of several text mining challenges. Relevant examples are the TAC 2014
Biomedical Summarization Track31 and the series of Computational Linguistic
Scienti c Summarization Tasks32 where the participating teams have been
required to generate summaries of scienti c articles by taking advantage of the
citations they receive. Recently, the participants of the Semantic Publishing
Challenges [31] have been required to extract structured information from
pa31 http://tac.nist.gov/2014/BiomedSumm/
32 http://wing.comp.nus.edu.sg/cl-scisumm2017/
pers and model this knowledge as RDF datasets in order to enable the evaluation
of the quality of scienti c events and publications by means of SPARQL queries.
By relying on the data of the Microsoft Academic Graph,33 several shared tasks
have been also proposed dealing with author disambiguation and relevance
ranking of papers and institutions [26, 32].
5</p>
        <sec id="sec-3-4-1">
          <title>Conclusions and Future Work</title>
          <p>With the availability of massive amounts of scienti c publications, research in
scienti c text mining has proliferated in recent years. However, most approaches
perform shallow NLP analysis over scienti c contents and consider mainly
English publications. In this paper we have described the initial release of the
SEPLN Anthology, an automatically analyzed textual corpus created by mining
SEPLN publications. We have also introduced a con gurable toolkit developed
to transform PDF documents into pairs of XML and HTML les that we
further analyzed by general and language speci c customizable NLP techniques
so as to create rich semantic representations of SEPLN documents. We
analyze each paper by performing: (i) dependency parsing of English and Spanish
sentences, (ii) citation identi cation, linking, and enrichment, (iii) author
information identi cation, (iv) concept disambiguation, (v) information extraction,
and (vi) document summarization. Furthermore, we have also developed a
webbased information access platform which exploits the SEPLN Anthology
documents to provide interesting single-document and document collection-based
visualizations as a means to explore the rich generated contents.</p>
          <p>The resources developed in this work are being made available to the research
community. We are committed to keep evolving the SEPLN Anthology (e.g.,
releasing periodic versions, adding functionalities) so as to make it useful in
both research and educational activities. There are several avenues of future
research we would like to pursue: one of the most relevant is the creation of
a gold standard annotation dataset (subset of representative documents) with
curated information on authors, rhetorical categories, citations, etc. In parallel,
we would like to improve the precision of PDFdigester as well as to perform
usercentric evaluations of our web-based interface in order to better understand the
value and possibilities of the rich scienti c corpora search and browsing patterns
we propose.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>Acknowledgements</title>
          <p>This work is (partly) supported by the Spanish Ministry of Economy and
Competitiveness under the Mar a de Maeztu Units of Excellence Programme
(MDM2015-0502) and by the TUNER project (TIN2015-65308-C5-5-R, MINECO /
FEDER, UE).
33 http://microsoft.com/en-us/research/project/microsoft-academic-graph/
18. Mariani, J., Paroubek, P., Francopoulo, G., Hamon, O.: Rediscovering 15 + 2
years of discoveries in language resources and evaluation. Language Resources and
Evaluation 50(2), 165{220 (2016)
19. Maynard, D., Tablan, T., Cunningham, H., Ursu, C., Saggion, H., Bontcheva, K.,
Wilks, Y.: Architectural elements of language engineering robustness. In: Natural
Language Engineering. 8(2-3):257{274 (2002)
20. Monaghan, F., Bordea, G., Samp, K., Buitelaar, P.: Exploring your research:
Sprinkling some Sa ron on semantic Web Dog Food. In: Semantic Web Challenge at the
ISWC. vol. 117, pp. 420{435 (2010)
21. Moro, A., Cecconi, F., Navigli, R.: Multilingual word sense disambiguation and
entity linking for everybody. In: Proceedings of the 2014 IISWC-PD Conference. pp.
25{28 (2014)
22. Osborne, F., Motta, E., Mulholland, P.: Exploring scholarly data with Rexplore.</p>
          <p>In: The Semantic Web - ISWC 2013. pp. 460{477 (2013)
23. Radev, D.R., Muthukrishnan, P., Qazvinian, V.: The ACL Anthology Network</p>
          <p>Corpus. In: NLPIRDL 2009. pp. 54{61 (2009)
24. Ronzano, F., Saggion, H.: Dr. Inventor Framework: Extracting structured
information from scienti c publications. In: International Conference on Discovery Science
2015. Springer, pages 209{220 (2015)
25. Ronzano, F., Saggion, H. Knowledge extraction and modeling from scienti c
publications. In the Proceedings of the Workshop Semantics, Analytics, Visualisation:
Enhancing Scholarly Data co-located with the 25th International World Wide Web
Conference. (2016)
26. Roy, S. B., De Cock, M., Mandava, V., Savanna, S., Dalessandro, B., Perlich, C.,
Hamner, B.: The microsoft academic search dataset and kdd cup 2013. In Proceedings
of the 2013 KDD cup 2013 workshop. p. 1. (2013)
27. Saggion, H.: SUMMA: A robust and adaptable summarization tool. Traitement</p>
          <p>Automatique des Langues 49(2). pp103{125 (2008)
28. Schafer, U., Kiefer, B., Spurk, C., Ste en, J., Wang, R.: The acl anthology
searchbench. In: Proceedings of the 49th ACL. pp. 7{13 (2011)
29. Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.P., Wang, K.: An overview
of Microsoft Academic Service (MAS) and applications. In: Proceedings of the 24th
WWW Conference. pp. 243{246 (2015)
30. Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, L.: CERMINE:
automatic extraction of structured metadata from scienti c literature. International
Journal on Document Analysis and Recognition (IJDAR) 18(4), 317{335 (2015)
31. Vahdati, S., Dimou, A., Lange, C., Di Iorio, A.: Semantic publishing challenge:
bootstrapping a value chain for scienti c data. In International Workshop on
Semantic, Analytics, Visualization. pp. 73{89 (2016)
32. Wade, A. D., Wang, K., Sun, Y., Gulli, A.: WSDM Cup 2016: Entity Ranking
Challenge. In Proceedings of the Ninth ACM International Conference on Web Search
and Data Mining. pp. 593{594 (2016)
33. Witten, I. H., Frank, E., Hall, M. A., Pal, C. J.: Data Mining: Practical machine
learning tools and techniques. Morgan Kaufmann (2016)</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Accuosto</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ronzano</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferres</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saggion</surname>
          </string-name>
          , H.:
          <article-title>Multi-level mining and visualization of scienti c text collections</article-title>
          .
          <article-title>Exploring a bi-lingual scienti c repository</article-title>
          .
          <source>In: Proceedings of WOSP 2017 - ACM/IEEE-CS Joint Conference on Digital Libraries. ACM</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dale</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dorr</surname>
            ,
            <given-names>B.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibson</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joseph</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Powley</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radev</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>Y.F.</given-names>
          </string-name>
          :
          <article-title>The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics</article-title>
          .
          <source>In: Proceedings of LREC</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bohnet</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Very high accuracy and fast dependency parsing is not a contradiction</article-title>
          .
          <source>In: Proceedings of the 23rd COLING</source>
          . pp.
          <volume>89</volume>
          {
          <fpage>97</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Boudin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>TALN archives : une archive numerique francophone des articles de recherche en traitement automatique de la langue</article-title>
          .
          <source>In: TALN 2013</source>
          . pp.
          <volume>507</volume>
          {
          <issue>514</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Constantin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pettifer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voronkov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>PDFX: fully-automated PDF-to-XML conversion of scienti c literature</article-title>
          .
          <source>In: Proceedings of the 2013 ACM symposium on Document engineering</source>
          . pp.
          <volume>177</volume>
          {
          <fpage>180</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tablan</surname>
          </string-name>
          , V.:
          <article-title>JAPE: a Java Annotation Patterns Engine (Second Edition)</article-title>
          . Research Memorandum CS{
          <volume>00</volume>
          {
          <fpage>10</fpage>
          , Department of Computer Science, University of She eld (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Directorate-
          <article-title>General for Research and Innovation (European Commission): Open innovation, open science, open to the world: A vision for Europe (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dunne</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shneiderman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gove</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klavans</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dorr</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Rapid understanding of scienti c paper collections: Integrating statistics, text analytics, and visualization</article-title>
          .
          <source>J. Am. Soc. Inf. Sci. Technol</source>
          .
          <volume>63</volume>
          (
          <issue>12</issue>
          ),
          <volume>2351</volume>
          {
          <fpage>2369</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Erkan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radev</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          : Lexrank:
          <article-title>Graph-based lexical centrality as salience in text summarization</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          <volume>22</volume>
          ,
          <volume>457</volume>
          {
          <fpage>479</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Fisas</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ronzano</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saggion</surname>
          </string-name>
          , H.:
          <article-title>A multi-layered annotated corpus of scienti c papers</article-title>
          .
          <source>In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC</source>
          <year>2016</year>
          ).
          <source>European Language Resources Association (ELRA)</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Holtz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teich</surname>
          </string-name>
          , E.:
          <article-title>Design of the Darmstadt Scienti c Text Corpus (DaSciTex)</article-title>
          .
          <source>Technical Report DFG project TE 198/1-1</source>
          , Technische Universitat Darmstadt (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Jacso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Google scholar: the pros and the cons</article-title>
          .
          <source>Online Information Review</source>
          <volume>29</volume>
          (
          <issue>2</issue>
          ),
          <volume>208</volume>
          {
          <fpage>214</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peirsman</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chambers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Stanford's multi-pass sieve coreference resolution system at the CoNLL-2011 shared task</article-title>
          .
          <source>In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task</source>
          (pp.
          <fpage>28</fpage>
          -
          <lpage>34</lpage>
          ).
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ley</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>DBLP: Some lessons learned</article-title>
          .
          <source>VLDB Endowment</source>
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <volume>1493</volume>
          {
          <fpage>1500</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Councill</surname>
            ,
            <given-names>I.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>CiteSeerX: an architecture and web service design for an academic document search engine</article-title>
          .
          <source>In: Proceedings of the 15th WWW Conference</source>
          . pp.
          <volume>883</volume>
          {
          <issue>884</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Lipinski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breitinger</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gipp</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Evaluation of header metadata extraction approaches and tools for scienti c PDF documents</article-title>
          .
          <source>In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries</source>
          . pp.
          <volume>385</volume>
          {
          <fpage>386</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Grobid:
          <article-title>Combining automatic bibliographic data recognition and term extraction for scholarship publications</article-title>
          .
          <source>In: Proceedings of ECDL'09</source>
          . pp.
          <volume>473</volume>
          {
          <fpage>474</fpage>
          . Springer-Verlag, Berlin, Heidelberg (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>