<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Editorial for the First Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iana Atanassova</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marc Bertin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Mayr</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The open access movement in scientific publishing and search engines like Google Scholar
has made scientific articles more broadly accessible. During the last decade, the availability of
scientific papers in full text has become more and more widespread thanks to the growing
number of publications on online platforms such as ArXiv and CiteSeer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The efforts to
provide articles in machine-readable formats and the rise of Open Access publishing have
resulted in a number of standardized formats for scientific papers (such as NLM-JATS, TEI,
DocBook), full text datasets for research experiments (PubMed, JSTOR, etc.) and corpora
(iSearch, etc.). At the same time, research in the field of Natural Language Processing have
provided a number of open source tools for versatile text processing (e.g. NLTK, Mallet,
OpenNLP, CoreNLP, Gate [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], CiteSpace [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
      </p>
      <p>
        Scientific papers are highly structured texts and display specific properties related to their
references but also argumentative and rhetorical structure. Recent research in this field has
concentrated on the construction of ontologies for citations and scientific articles (e.g. CiTO
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Linked Science [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) and studies of the distribution of references [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, up to now
full text mining efforts are rarely used to provide data for bibliometric analyses. While
Bibliometrics traditionally relies on the analysis of metadata of scientific papers (see e.g. a
recent special issue on Combining Bibliometrics and Information Retrieval edited by Mayr &amp;
Scharnhorst, 2015 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), we will explore the ways full-text processing of scientific papers and
linguistic analyses can play. With this workshop we like to discuss novel approaches and
provide insights into scientific writing that can bring new perspectives to understand both the
nature of citations and the nature of scientific articles. The possibility to enrich metadata by
the full text processing of papers offers new fields of application to Bibliometrics studies.
Working with full text allows us to go beyond metadata used in Bibliometrics. Full text offers
a new field of investigation, where the major problems arise around the organization and
structure of text, the extraction of information and its representation on the level of metadata.
Furthermore, the study of contexts around in-text citations offers new perspectives related to
the semantic dimension of citations. The analyses of citation contexts and the semantic
categorization of publications will allow us to rethink co-citation networks, bibliographic
coupling and other bibliometric techniques.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Workshop outline</title>
      <p>The workshop “Mining Scientific Papers: Computational Linguistics and Bibliometrics”
(CLBib 2015)1, co-located with the 15th International Society of Scientometrics and
Informetrics Conference (ISSI 2015)2, brought together researchers in Bibliometrics and
Computational Linguistics in order to study the ways Bibliometrics can benefit from
largescale text analytics and sense mining of scientific papers, thus exploring the interdisciplinarity
of Bibliometrics and Natural Language Processing (NLP). The goals of the workshop were to
answer questions like: How can we enhance author network analysis and Bibliometrics using
data obtained by text analytics? What insights can NLP provide on the structure of scientific
writing, on citation networks, and on in-text citation analysis?</p>
      <sec id="sec-2-1">
        <title>The workshop topics included the following:</title>
        <p>Linguistic modelling and discourse analysis for scientific texts
User interfaces, text representations and visualizations
Structure of scientific articles (discourse / argumentative / rhetorical / social)
Scientific corpora and paper standards
Act of citations, in-text citations and Content Citation Analysis
Co-citation and bibliographic coupling
Text enhanced bibliographic coupling
Terminology extraction
Text mining and information extraction
Scholarly information retrieval
Ontological descriptions of scientific content</p>
        <p>Knowledge extraction
The call for papers attracted many contributions, showing a large interest in these topics in the
community. After a selection process six papers were presented at the workshop. All papers
have been peer-reviewed by at least two reviewers from the Program Committee. The
following section briefly outlines the papers that were presented.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Overview of the papers</title>
      <p>
        Natural language processing methods can be applied to scientific corpora in various domains
in order to obtain data and gain insights on the domain and synthesize the information present
in the documents. In the paper “NLP4NLP: Applying NLP to scientific corpora about written
and spoken language processing” by Gil Francopoulo, Joseph Mariani and Patrick
Paroubek [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the authors report the creation of a large corpus of papers in Natural Language
Processing and the processing of the corpus using methods and tools in the same field. The
creation of large-scale scientific corpora that are accessible in interoperable formats, such as
RDF, is an important step towards the development of NLP tools dedicated to scientific
writing.
      </p>
      <p>From the point of view of information extraction and text mining, the paper “Accurate
Keyphrase Extraction from Scientific Papers by Mining Linguistic Information” by Mounia</p>
    </sec>
    <sec id="sec-4">
      <title>Haddoud, Aïcha Mokhtari, Thierry Lecroq and Saïd Abdeddaïm [9], proposes a method</title>
      <p>for keyphrase extraction from scientific papers using a supervised machine learning algorithm
with linguistically motivated features, and more specifically a noun-phrase filter. The authors
experiment with the SemEval-2010/Task-5 dataset and report that the use of linguistically
motivated features in the classifier improves the system‟s ability to extract correct keyphrases
compared to other similar systems.</p>
    </sec>
    <sec id="sec-5">
      <title>Bilal Hayat, Muhammad Rafi, Arsal Jamal, Raja Sami Ur Rehman, Muhammad Bilal</title>
      <p>
        Alam and Syed Muhammad Zubair Alam propose to classify citations as sentiment positive
or sentiment negative in the paper “Classification of Research Citations (CRC)” [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. They
use a sentiment lexicon with Naïve-Bayes Classifiers for sentiment analysis. Their algorithm
is evaluated on a manually annotated and class labelled collection of 150 research papers from
the domain of computer science and preliminary results show an accuracy of 80%. They plan
to provide a web portal to assist the scholars in the automatic searching and downloading of
citing papers, for a seed paper, and the classification of citing papers into sentiment
categories.
      </p>
      <p>
        In their paper “Using noun phrases extraction for the improvement of hybrid clustering with
text- and citation-based components. The example of „Information System Research‟”, Bart
Thijs, Wolfgang Glänzel and Martin Meyer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] propose a method for document clustering
by the extraction of noun phrases from abstracts and titles to improve the measurement of the
lexical component, and using bibliographic coupling for the citation component. They show
in the hybrid clustering approach, removing all single term shingles provides the best results
at the level of computational feasibility, comparability with bibliographic coupling and also in
a community detection application.
      </p>
      <p>
        The problem of automatic terminology extraction from scientific corpora is addressed in the
paper “The Termolator: Terminology Recognition based on Chunking, Statistical and
Searchbased Scores” by Adam Meyers, Yifan He, Zachary Glass and Olga Babko-Malaya [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
They propose the Termolator system, which combines several different approaches to extract
chunks from texts. The authors examine several metrics for the ranking of the extracted terms
and analyse the effects of these metrics on a corpus of 5000 patents on a specific topic. They
report an accuracy of about 86% among the top 5000 terms.
      </p>
      <p>
        The paper "Towards Authorship Attribution for Bibliometrics using Stylometric Features" by
Andi Rexha, Stefan Klampfl, Mark Kröll and Roman Kern [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] describes a new text
segmentation algorithm to identify potential author changes within the main text of a
scientific article. Their approach captures stylistic changes in papers and adopts different
stylometric features like type-token-ratio, hapax-legomena or average-word-length. A
preliminary evaluation of a small subset of PubMed shows that the more authors an article has
the more potential author changes are identified. In the paper the authors illustrate the style
change among papers visually.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Outlook</title>
      <p>This first workshop raises some important questions around the interactions between
Bibliometrics and NLP. From an application point of view, the adequate tools need to be
developed for the processing, information extraction and text mining of scientific corpora in
order to produce new metadata or populate ontologies in the context of Semantic web. Tools
for the efficient processing and conversion of PDF files are also necessary, allowing to
produce corpora of scientific papers as structured documents, for example using XML
schemas like JATS3 that can be directly analysed by natural language processing modules.
The main applications, beyond Bibliometrics, are in the field of Semantic Web.
The large-scale text analytics on scientific papers can also have an impact on the theory of
citation, and contribute to define the nature of scientific citations. Text enhanced bibliometric
approaches can provide ground for the development of new linguistics models related to the
phenomenon of citation, involving new breakthroughs in this field. This workshop is the first
step to foster the reflection on the interdisciplinarity and the benefits that the two disciplines
Bibliometrics and Natural Language Processing can drive from it.</p>
      <sec id="sec-6-1">
        <title>3 http://jats.nlm.nih.gov/</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Wu</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen H</surname>
          </string-name>
          .
          <string-name>
            <surname>-H.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Khabsa</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caragea</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ororbia</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Giles</surname>
            <given-names>C. L.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>"CiteSeerX: AI in a Digital Library Search Engine,"</article-title>
          <source>Innovative Applications of Artificial Intelligence, Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI</source>
          <year>2014</year>
          ), pp.
          <fpage>2930</fpage>
          -
          <lpage>2937</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tablan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>GATE: an architecture for development of robust HLT applications</article-title>
          .
          <source>In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          , pp.
          <fpage>168</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology (JASIST)</source>
          ,
          <volume>57</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>359</fpage>
          -
          <lpage>377</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Shotton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>CiTO, the Citation Typing Ontology</article-title>
          .
          <source>Journal of Biomedical Semantics</source>
          ,
          <volume>1</volume>
          (
          <issue>Suppl 1</issue>
          ), S6. doi:
          <volume>10</volume>
          .1186/2041-1480-1-S1-S6
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kauppinen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baglatzi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Keßler</surname>
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Linked Science: Interconnecting Scientific Assets</article-title>
          . In Critchlow,
          <string-name>
            <given-names>T.</given-names>
            &amp;
            <surname>Kleese-Van Dam</surname>
          </string-name>
          ,
          <string-name>
            <surname>K</surname>
          </string-name>
          . (Eds.):
          <article-title>Data Intensive Science</article-title>
          . CRC Press, USA
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Bertin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atanassova</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larivière</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gingras</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>The Invariant Distribution of References in Scientific Papers</article-title>
          .
          <source>Journal of the Association for Information Science and Technology (JASIST)</source>
          ,
          <source>doi: 10</source>
          .1002/asi.23367
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Mayr</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Scharnhorst</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Scientometrics and Information Retrieval - weak-links revitalized</article-title>
          .
          <source>Scientometrics</source>
          .
          <volume>102</volume>
          (
          <issue>3</issue>
          ),
          <fpage>2193</fpage>
          -
          <lpage>2199</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11192-014-1484-3
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Francopoulo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Paroubek</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>NLP4NLP: Applying NLP to scientific corpora about written and spoken language processing</article-title>
          .
          <source>In Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI)</source>
          , Istanbul, Turkey: http://ceur-ws.org
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Haddoud</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mokhtari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lecroq</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Abdeddaïm</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Accurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information</article-title>
          .
          <source>In Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI)</source>
          , Istanbul, Turkey: http://ceur-ws.org
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Hayat</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rafi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jamal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ur</surname>
            <given-names>Rehman</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.S.</given-names>
            ,
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.B.</given-names>
            , &amp;
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.M.Z.</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Classification of Research Citations (CRC)</article-title>
          .
          <source>In Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI)</source>
          , Istanbul, Turkey: http://ceur-ws.org
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Thijs</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glänzel</surname>
            , W. &amp; Meyer,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Using noun phrases extraction for the improvement of hybrid clustering with text- and citation-based components. The example of “Information System Research"</article-title>
          .
          <source>In Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI)</source>
          , Istanbul, Turkey: http://ceur-ws.org
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Meyers</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glass</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Babko-Malaya</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>The Termolator: Terminology Recognition based on Chunking, Statistical and Search-based Scores</article-title>
          .
          <source>In Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI)</source>
          , Istanbul, Turkey: http://ceur-ws.org
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Rexha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klampfl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kröll</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Kern</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Towards Authorship Attribution for Bibliometrics using Stylometric Features</article-title>
          .
          <source>In Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI)</source>
          , Istanbul, Turkey: http://ceur-ws.org
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>