BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


InTeReC: In-text Reference Corpus for Applying
 Natural Language Processing to Bibliometrics

                       Marc Bertin1 and Iana Atanassova2
          1
          ELICO Laboratory, Université Claude Bernard Lyon 1, France
                         marc.bertin@univ-lyon1.fr
     2
       CRIT-Centre Tesnière, Université de Bourgogne Franche-Comté, France
                      iana.atanassova@univ-fcomte.fr


      Abstract. Bibliometrics is more and more interested in the full text
      processing and the study of the structure of scientific papers. The con-
      texts of in-text references present in articles are particularly relevant
      for such studies. This work describes the construction of the InTeReC
      dataset, which is an in-text reference corpus that aims to promote ex-
      perimental reproducibility and to provide a standard dataset for further
      research. The InTeReC dataset is a set of sentences containing in-text
      references together with all the data necessary for their recontextual-
      ization in papers using standard CSV format. This should encourage
      the implementation of natural language processing tools for Bibliometric
      studies and related research in information retrieval and visualization.

      Keywords: Bibliometrics, Citation Analysis, Citation Context Analy-
      sis, Information Analysis, Natural Language Processing, IMRaD


1   Introduction
The assumption that the contexts of the bibliographic references present in a sci-
entific article play an important role in characterizing the relationship between
citing works and cited works have been accepted for many decades. Publica-
tions are connected to each other by citations and citations contexts categorize
the semantic relations that exist between them. Whether the study of citation
contexts relies mainly on linguistic clues or machine learning techniques, cita-
tion contexts for each research experiment need to be extracted from scientific
corpora. Also, the extraction of citation contexts is a preliminary step to any
statistical, distributional, syntactic or semantic analysis.
    Sentences containing in-text references may contain relevant information
about the cited research and cited author’s research areas. Recently, He and
Chen [12] provides an approach to understanding citation contexts which charac-
terizes the complex roles of a publication. If we are interested in the intellectual
structure of a discipline, the analysis of co-citations has been widely studied.
However, recent work highlights the interest of taking into account the full text
and more precisely the paragraphs [13] of papers. Other approaches are inter-
ested in extracting information from publications and adding semantic attributes


                                        54
                                        BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


to in-text references that can be defined as traditional. For example, Parinov [17]
focuses on papers’ references, in-text references and citation contexts with the
purpose to visualize citations relationships, their semantic attributes and related
statistics as annotations.
    In this paper, we propose a large scale dataset of citation contexts and ex-
plain the methods that were used for its construction. Other similar initiatives
exist with various objectives. For example, the ESWC-14 Challenge: Semantic
Publishing3 – Assessing the Quality of Scientific Output (see [11]) focus on the
extraction and assessment of workshop proceedings.The recent activity of re-
search based on full text and the analysis of in-text references has lead to a race
in the size of datasets. If we look at the size of the corpora used by the different
actors of our community, we can see that the values are very heterogeneous. The
corpus for the CL-SciSumm task4 deal with automatic paper summarization in
Computational Linguistics and is extracted from the ACL Anthology corpus and
its citing papers [15]. In 2017, Hu et al. [14] worked with 350 articles from Journal
of Informetrics. In 2013 Ding et al. [10] analyse 866 articles from JASIST. The
largest study of in-text reference distributions in 2016 was proposed by Bertin
et al. [7] who analysed 45,000 papers published in the PLOS journals. Recently,
Boyack et al. [9] focused on the PubMed Central Open Access Subset and El-
sevier journals with five million full text records for in-text reference analysis.
It is clear that we observe an increase in the size of the textual data but also
a methodological evolution in the processing capabilities, advocating for larger
datasets and the use of statistical tools. In general, the construction of a corpus
is a heavy task and requires skills and means that can be important.
    In this paper we describe the creation of the InTeReC dataset [6], which is
a corpus of sentences containing in-text references extraction from papers pub-
lished by PLOS. This dataset is available at https://zenodo.org/record/1203737.
Here we will not detail or define what a corpus is. For this we can refer for ex-
ample to Lüdeling and Kytö (see [16]).
    The aim of this paper is to present the method of the construction of the In-
TeReC corpus, which is a standard in-text reference corpus taking into account
the different elements relevant to the implementation of experimental protocols.
The overall objective is to facilitate citation context analyses and various distri-
butional analyses by providing a large dataset to the community. The InTeReC
dataset also serves the purposes of reproducibility, interoperability and cumula-
tive research.


2     Method
The construction of the InTeReC dataset is based on several analyzes and exper-
iments carried out in the recent years. In this section we summarize the methods
that were used in order to propose a dataset that is reusable by the community
for studying citation contexts.
3
    http://challenges.2014.eswc-conferences.org/index.php/SemPub
4
    http://wing.comp.nus.edu.sg/ cl-scisumm2018/


                                        55
                                         BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


    Working with the full text of papers, we first classify the section titles in
order to identify the four major section types in the IMRaD sequence (Intro-
duction, Methods, Results and Discussion). This categorization aims to verify
the coherence of the corpus with the IMRaD structure. In many articles, the
four section types exist but not in the same order. For the InTeReC dataset, we
focused only on paper that follows the IMRaD structure, i.e. papers that contain
the four section types in the correct order. We then process the text content of
all paragraphs and segment them into sentences. In our approach, sentences are
considered as the basic textual units and are used to express the positions of
references in the article and in the section. This approach allows for example
to assign relative positions of all references and to obtain the distribution of
references along the text [7]. Finally, we count the number of references in each
sentence. The InTeReC dataset contains only sentences are to have one single
in-text reference.
    The links between the in-text references and the cited papers or bibliography
items are preserved throughout the processing.


2.1   Data: source and structure

For this corpus, we have used the entire set of research articles published by
PLOS5 up to September 2013. This initial corpus contains 90,071 articles.
    As these 7 journals follow the same publication model but are in different
scientific fields, our aim is to observe the different uses of bibliographic references
in these fields and their relation to the structure of the articles.
    PLOS provides access to the articles in the XML format. The set of XML el-
ements and attributes that are used for the representation of journal articles are
known as Journal Article Tag Suite (JATS), which is an application of Z39.96-
2012. Technology evolves quickly and we have to take into consideration that
JATS is a continuation of the NLM Archiving and Interchange DTD works by
NCBI6 . As this format is also used by PubMed, this work can easily be extended
to processing the PubMed Open Access Subset which is a larger dataset. The
JATS structure of an article consists of three main elements: front – body – back,
and the textual content of the article is in the body element. It is further di-
vided into sections and paragraphs. The front element contains some traditional
metadata fields (title, authors, etc.) as well as the article type.
    Different types of articles are present in the corpus, such as "Research arti-
cle", "Synopsis", "Primer", "Essay", and the typology is given in the article’s
metadata. We have focused on the "Research article" type, obtaining a total of
85,660 articles out of the initial 90,071 articles in the corpus.
5
  Founded in 2001, the Public Library of Science (PLOS) is an Open Access pub-
  lisher of seven peer-reviewed academic journals, mostly in the fields of Biology and
  Medicine. PLOS ONE, the publishers’ general journal covers, however, all fields of
  science and social sciences.
6
  http://dtd.nlm.nih.gov


                                         56
                                       BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


2.2   Segmentation and section title processing


One of our objectives is to identify the rhetorical structure of the articles. The
use of the IMRaD sequence (Introduction, Methods, Results and Discussion) is
part of the editorial requirements of the PLOS journals and the large majority
of articles include these four sections. In some articles however, the sections are
not always in the same order.
    Sections are represented as separate elements in the original XML files. The
research articles in the corpus contain a total of 404 311 sections. We categorized
them automatically by analyzing the section titles in order to match the existing
sections with one of the section types in the IMRaD structure [1,2]. In fact,
variations can exist in the ways authors choose to title the sections, e.g. the
Methods section can have titles such as "Materials and Methods", "Method and
Model", etc. We have constructed a set of regular expressions in order to classify
the sections automatically. Table 1 presents some basic statistics of the result of
this classification. The last two classes, (MR) and (RD), appear in some articles
where one section merges two of the main section types and thus the article
contains only three main sections.


                 Class Section type          Number of sections
                 I     Introduction                      83,961
                 M     Methods                           84,006
                 R     Results                           76,909
                 D     Discussion                        76,964
                 (MR) Methods and Results                    32
                 (RD) Results and Discussion              7,072
                 Total                                 328,944

Table 1: Classification of section titles according to the four section types of
IMRaD


    We further restricted the set of sentences to be included in the InTeReC
dataset by selecting only sentences that have a single citation and that contain
at least one occurrence of the most frequent verbs that have been attested in
citation contexts. These steps are explained in the following subsections.
    Each paragraph was segmented into sentences by analyzing the punctuation
of the text following a set of typographic rules. All the occurrences of symbols
denoting sentence boundaries (point, exclamation mark, etc.) were examined and
disambiguated. In fact, the occurrence of a point in a text does not necessarily
mean a sentence end, because in many cases it can be part of an abbreviation,
references, genus species, numeric values, etc. We used a set of finite-state au-
tomata in order to determine the contexts in which the points signal sentence
ends.


                                        57
                                        BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


2.3   Article structures

Once we classified the sections, we examined the sequence of sections present
in each article. To produce the InTeReC dataset, we focused only on articles
where the order of the four sections is: "I,M,R,D". Considering merged sections,
there are three possible article structures, that are listed in table 2. The last two
columns of this table give the total number of sentences in the articles and the
number of sentences that contain at least one in-text reference.


         Article structure Articles Sentences Sentences with references
         I,M,R,D            44,370 7,656,518                  1,704,326
         I,M,(RD)            2,971    504,246                   113,237
         I,(MR),D               28      5,300                       937
         Total              47,369 8,166,064                 1,818,500

           Table 2: Article structures following the IMRaD sequence


    The following processing was done on these 47,369 articles from which was
selected the InTeReC dataset.


2.4   Reference processing

Our algorithm examines each sentence and counts the number of references
present in the text. In fact, the input data is in the XML format where the
references are represented in <xref> tags. Our algorithm covers all possible ty-
pographic variations for reference ranges and infers the missing data from the
input XML. As a result we obtain the list of sentences in the text, where to
each sentence we have associated a reference count as well as a list of reference
identifiers corresponding to the bibliography entries.
    We note that counting the <xref> tags are not a reliable method to obtain
the reference counts, especially if one is interested in multiple in-text references
(MIR) [5]. When in-text references are in a numeric form, reference ranges are
often present in sentences containing MIR. For example, in-text references such
as ”[16]–[24]” are represented by two xref elements that point to the correspond-
ing bibliography items, while in fact there are 9 different citations, 7 of which
are not present in the XML markup. In order to identify correctly MIR and their
number in sentences it is important to detect in-text reference ranges.
    For this first version of the InTeReC dataset we have chosen to include only
sentences that contain one single reference. These citation contexts establish
links between only two works, the cited work and the citing article, and thus
we can consider them as the simplest cases to study in terms of citation context
analysis.


                                        58
                                        BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


2.5    Part-Of-Speech tagging and verb phrase extraction
A series of experiments have been published around verbs occurring in citation
contexts and their distributions [3,4]. In general, verbs give important informa-
tion about the nature of the relation between the article and the cited work.
Polysemy is one possible problem when dealing with verbs, but in our case this
phenomenon is reduced as we work specifically on citation contexts. Our hypoth-
esis is that the semantic meaning of the relation that exists between the cited
work and the citing article is often expressed, to some extent, by the verb phrase
in the sentence containing the in-text reference. For this reason, we examined
verb phrases that appear in the sentences that contain citations and included
them in the dataset.
    Bertin et al. [3,4]. published lists of most frequent verbs that appear in cita-
tion contexts with respect to section types in the IMRaD structure. We consid-
ered the most frequent verbs in each section that are given on table 3.


                           show      use       include
                           suggest   identify find
                           require   associate involve
                           lead      perform follow
                           obtain    generate base
                           determine contain calculate
                           carry     report    observe
                           express   see

Table 3: Verbs that appear in citation contexts: most frequent verbs in the
different sections of the IMRaD structure [3,4]


    All sentences were processed using the Part-Of-Speech tagger of python
NLTK7 [8]. In the output verb forms are tagged by labels such as VB, VBD,
VBG, VBN, VBP, VBZ that stands for base form, past tense, present participle,
etc. We then identified verb phrases by producing parse trees using a simple
grammar.
    The InTeReC dataset contains sentences that contain occurrences of the verbs
in table 3 and their verb phrases have been identified. By keeping only sentences
that contain these verbs, we eliminate many sentences that contain perfunctory
citations because they only mention the cited work without explicitely identify-
ing the its relation with the article. Thus we obtained the final set of 314,023
sentences for the dataset.

3     InTeReC dataset structure
The InTeReC dataset contains a list of sentences in full text. Information is
given on the position of the sentences relative to the article and the section
7
    https://www.nltk.org


                                        59
                                       BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


in which they appear, the section type with respect to the four main types of
the IMRaD structure, as well as verb phrases that occur in the sentence. Each
sentence contains one single in-text reference.
     The dataset is published in the CSV format [6], with the following column
list:

journal: journal title
doi: DOI of the article from which the sentence was extracted
article-length: size of the article, as number of sentences
article-pos: position of the sentence in the article, as number of sentences from
    the beginning of the article
section-length: size of the section, as number of sentences
section-pos: position of the sentence in the section, as number of sentences
    from the beginning of the section
section-type: section type (one of: I, M, R, D, MR, RD)
sentence-text: full text of the sentence
verb-phrases: a list of verb phrases that occur in the sentence, comma sepa-
    rated

   The format of the dataset has been chosen to facilitate the exploitation of the
data and make it compatible and easily reusable for most types of processing.


4   Discussion and Conclusion
Although this corpus takes many aspects into account, it is not exhaustive and
has characteristics that underline the inherent limitations of this approach. For
example, we have limited this work to level of the sentence, and thus do not take
into account relevant citation contexts that span across sentence boundaries
through the use of anaphora.
    Many improvements are possible and they are currently receiving our atten-
tion in the evolution of our corpus. The first concerns the quantitative aspect
with the extension of the corpus to new sources. For this, we can take into
account for example PubMed8 , arXiv9 and the CEUR Workshop Proceedings10 .
    The second evolution concerns a more qualitative aspect of the dataset and
adding semantic annotations. The main idea is to implement semantic annotation
in order to provide an training dataset for supervised learning tools. For this
purpose, we investigate tools to annotate segments with precision values close to
those of human annotators. Indeed, one challenge would be proposing semantic
annotations for ontologies such as CiTO (see [18]).
    Finally, the InTeReC dataset can be accessed and visualized using R-shiny11
interface in order to provide users a way to interact with the data and observe
the distributional phenomena.
8
   www.nlm.nih.gov/databases/download/pubmed_medline.html
9
   https://arxiv.org
10
   http://ceur-ws.org
11
   https://shiny.rstudio.com


                                       60
                                          BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


    This dataset aims to facilitate the reproducibility of future research on in-text
citation analysis and thus provide a common foundation for the development of
a unified model of citation context analysis.


5    Acknowledgments

We thank Benoit Macaluso of the Observatoire des Sciences et des Technologies
(OST), Montreal, Canada, for harvesting and providing the PLOS dataset.


References

 1. Atanassova, I., Bertin, M.: Semantic Web Evaluation Challenge: SemWebEval 2014
    at ESWC 2014, chap. Semantic Facets for Scientific Information Retrieval, pp. 108–
    113. Communications in Computer and Information Science (Book 475), Springer,
    Anissaras, Crete, Greece (May 25-29 2014)
 2. Bertin, M., Atanassova, I.: Semantic Web Evaluation Challenge: SemWebEval 2014
    at ESWC 2014, chap. Extraction and Characterization of Citations in Scientific
    Papers, pp. 120–128. Communications in Computer and Information Science (Book
    475), Springer, Anissaras, Crete, Greece (May 25-29 2014)
 3. Bertin, M., Atanassova, I.: A study of lexical distribution in citation con-
    texts through the imrad standard. In: Proceedings of the First Workshop on
    Bibliometric-enhanced Information Retrieval co-located with 36th European Con-
    ference on Information Retrieval (ECIR 2014). vol. 1143, pp. 5–12. CEUR Work-
    shop Proceedings, Amsterdam, The Netherlands (April 13 2014)
 4. Bertin, M., Atanassova, I.: Factorial correspondence analysis applied to citation
    contexts. In: Proceedings of the First Workshop on Bibliometric-enhanced Informa-
    tion Retrieval co-located with 37th European Conference on Information Retrieval
    (ECIR 2015). Vienna, Austria (March 29 2015)
 5. Bertin, M., Atanassova, I.: Multiple in-text reference aggregation phenomenon. In:
    Proceedings of the 3rd Workshop on Bibliometric-enhanced Information Retrieval
    co-located with 38th European Conference on Information Retrieval (ECIR 2016).
    pp. 14–22. Padua, Italy (2016)
 6. Bertin, M., Atanassova, I.: InTeReC: In-text Reference Corpus - Single References
    Dataset (Mar 2018), https://doi.org/10.5281/zenodo.1203737
 7. Bertin, M., Atanassova, I., Gingras, Y., Larivière, V.: The invariant distribution of
    references in scientific articles. Journal of the Association for Information Science
    and Technology 67(1), 164–177 (2016), http://dx.doi.org/10.1002/asi.23367
 8. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly
    Media (2009)
 9. Boyack, K.W., van Eck, N.J., Colavizza, G., Waltman, L.: Characterizing in-
    text citations in scientific articles: A large-scale analysis. Journal of Informetrics
    12(1), 59 – 73 (2018), http://www.sciencedirect.com/science/article/pii/
    S1751157717303516
10. Ding, Y., Liu, X., Guo, C., Cronin, B.: The distribution of references
    across texts: Some implications for citation analysis. Journal of Informetrics
    7(3), 583 – 592 (2013), http://www.sciencedirect.com/science/article/pii/
    S1751157713000230


                                           61
                                          BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


11. Dragoni, M., Solanki, M., Blomqvist, E.: Semantic Web Challenges: 4th SemWe-
    bEval Challenge at ESWC 2017, Portoroz, Slovenia, May 28-June 1, 2017, Revised
    Selected Papers, vol. 769. Springer (2017)
12. He, J., Chen, C.: Understanding the changing roles of scientific publications via
    citation embeddings. arXiv preprint arXiv:1711.05822 (2017)
13. Hsiao, T.M., Chen, K.h.: Yet another method for author co-citation analysis: A
    new approach based on paragraph similarity. Proceedings of the Association for
    Information Science and Technology 54(1), 170–178 (2017), http://dx.doi.org/
    10.1002/pra2.2017.14505401019
14. Hu, Z., Lin, G., Sun, T., Hou, H.: Understanding multiply mentioned references.
    Journal of Informetrics 11(4), 948–958 (2017)
15. Jaidka, K., Chandrasekaran, M.K., Rustagi, S., Kan, M.Y.: Overview of the cl-
    scisumm 2016 shared task. In: In Proceedings of Joint Workshop on Bibliometric-
    enhanced Information Retrieval and NLP for Digital Libraries (BIRNDL 2016)
    (2016)
16. Lüdeling, A., Kytö, M.: Corpus linguistics: An international handbook. Citeseer
    (2008)
17. Parinov, S.: Semantic attributes for citation relationships: Creation and visualiza-
    tion. In: Garoufallou, E., Virkus, S., Siatri, R., Koutsomiha, D. (eds.) Metadata and
    Semantic Research. pp. 286–299. Springer International Publishing, Cham (2017)
18. Shotton, D.: CiTO, the citation typing ontology. Journal of biomedical semantics
    1(Suppl 1), S6 (2010)


                                          62