Mining Scholarly Data for Fine-Grained
           Knowledge Graph Construction

    Davide Buscaldi1 , Danilo Dessı̀2 , Enrico Motta3 , Francesco Osborne3 , and
                           Diego Reforgiato Recupero2
        1
       LIPN, CNRS (UMR 7030), University Paris 13, Villetaneuse, France
                  davide.buscaldi@lipn.univ-paris13.fr
     2
       Computer Science Department, University of Cagliari, Cagliari (Italy)
                {danilo dessi, diego.reforgiato}@unica.it
3
  Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
              {enrico.motta, francesco.osborne}@open.ac.uk


        Abstract. Knowledge graphs (KG) are large networks of entities and
        relationships, typically expressed as RDF triples, relevant to a specific
        domain or an organization. Scientific Knowledge Graphs (SKGs) focus on
        the scholarly domain and typically contain metadata describing research
        publications such as authors, venues, organizations, research topics, and
        citations. The next big challenge in this field regards the generation of
        SKGs that also contain an explicit representation of the knowledge pre-
        sented in research publications. In this paper, we present a preliminary
        approach that uses a set of NLP and Deep Learning methods for ex-
        tracting entities and relationships from research publications, and then
        integrates them in a KG. More specifically, we i) tackle the challenge
        of knowledge extraction by employing several state-of-the-art Natural
        Language Processing and Text Mining tools, ii) describe an approach
        for integrating entities and relationships generated by these tools, iii)
        analyze an automatically generated Knowledge Graph including 10 425
        entities and 25 655 relationships derived from 12 007 publications in the
        field of Semantic Web, and iv) discuss some open problems that have not
        been solved yet.

        Keywords: Knowledge Graph · Semantic Web · Knowledge Extraction
        · Scholarly Data · Natural Language Processing


1     Introduction
Knowledge graphs (KG) are large networks of entities and relationships, usually
expressed as RDF triples, relevant to a specific domain or an organization [6].
Many state-of-the-art projects such as DBPedia [9], Google Knowledge Graph,
BabelNet, and YAGO build KGs by harvesting entities and relations from textual
resources, such as Wikipedia pages. The creation of such KGs is a complex
process that typically requires to extract and integrate various information from
structured and unstructured sources.
    Scientific Knowledge Graphs (SKGs) focus on the scholarly domain and typi-
cally contain metadata describing research publications such as authors, venues,
22        Buscaldi et al.

organizations, research topics, and citations. Good examples are Open Academic
Graph4 , Scholarlydata.org [15], and OpenCitations [17]. These resources provide
substantial benefits to researchers, companies, and policy makers by powering
several data-driven services for navigating, analyzing, and making sense of re-
search dynamics. One of their main limitations is that the content of scientific
papers is represented by unstructured texts (e.g., title and abstract). Therefore,
a significant open challenge in this field regards the generation of SKGs that
contain also an explicit representation of the knowledge presented in scientific
publications [2], and potentially describe entities such as approaches, claims, ap-
plications, data, and so on. The resulting KG would be able to support a new
generation of content-aware services for exploring the research environment at a
much more granular level.
    Most of the relevant information for populating such a KG might be derived
from the text of research publications. In the last year, we saw the emergence of
several excellent Natural Language Processing (NLP) and Deep Learning meth-
ods for entity linking and relationship extraction [12, 2, 8, 11, 10]. However, inte-
grating the outputs of these tools in a coherent KG is still an open challenge.
    In this paper, we present a preliminary approach that uses a set of NLP and
Deep Learning methods for extracting entities and relationships from research
publications and then integrates them in a KG. Within our work, we refer to
an entity as a linguistic expression that refers to an object (e.g., topics, tools
names, a well-know algorithm, etc.). We define a relation between two entities
when they are syntactically or semantically connected. As an example, if a tool
T adopts an algorithm A, we may build the relation (T , adopt, A).
    The main contributions of this paper are: i) a preliminary approach that
combines different tools for extracting entities and relations from research pub-
lications ii) an approach for integrating these entities and relationships, iii) a
qualitative analysis of a generated SKG in the field of Semantic Web, and iv) a
discussion of some open problems that have not been solved yet.

2      Related Work
In textual resources there are both syntactical and semantic peculiarities that
make hard the identification of entities and relations.
    In previous works, entities in textual resources were detected by studying
Part-Of-Speech (POS) tags. An example is constituted by [14], where authors
provided a graph based approach for Word Sense Disambiguation (WSD) and
Entity Linking (EL) named Babelfly. Later, some approaches started to exploit
various resources (e.g., context information and existing KGs) for developing en-
semble methodologies [11]. Following this idea, we exploited an ensemble of tools
to mine scientific publications and get the input data. Subsequently, we have de-
veloped our methodology on top of the ensemble result. Relations extraction is
an important task in order to connect entities of a KG.
    For doing so, authors in [8] developed a machine reader called FRED which
exploits Boxer [4] and links elements to various ontologies in order to represent
4
     https://www.openacademic.ai/oag/
      Mining Scholarly Data for Fine-Grained Knowledge Graph Construction         23

the content of a text in a RDF representation. Among its features FRED ex-
tracts relations between frames, events, concepts and entities5 . One more project
that enables the extraction of RDF triples from text is [3], where a framework
called PIKES has been designed to exploit the frame analysis to detect entities
and their relations. These works consider a single text at a time and do not
consider the type of text they parse. In contrast with them, our approach aims
at parsing specific type of textual data and, moreover, at combining information
from various textual resources. We decided to rely on open domain information
extraction tool results refined by contextual information of our data, adapting
open domain results on Scholarly Data. In addition, we combined entities and
relations coming from different scientific papers instead of mining a single text
at a time. With our approach the resulting KG represents the overall knowledge
presented in the input scientific publications.
    Recently, extraction of relations from scientific papers has also raised interest
within the SemEval 2018 Task 7 Semantic Relation Extraction and Classification
in Scientific Papers challenge [7], where participants had to face the problem
of detecting and classifying domain-specific semantic relations. An attempt to
build KGs from scholarly data was also performed by [10], as an evolution of their
work at SemEval 2018 Task 7. Authors proposed both a Deep Learning approach
to extract entities and relations, and then built a KG on a dataset of 110, 000
papers. Our work finds inspiration from it, but we used different strategies to
address open issues for combining entities and relations. In fact, authors of [10]
considered clusters of co-referenced entities to come up with a representative
entity in the cluster and solving ambiguity issues. On the contrary, we adopted
textual and statistics similarity to solve it.


3     The Proposed Approach
In this section, we describe the preliminary approach that we applied to produce
a KG of research entities. We used an input dataset composed by 12 007 abstracts
of scientific publications about the Semantic Web domain. It was retrieved by
selecting all publications from the Microsoft Academic Graph dataset which
contains in the string ”Semantic Web” in the ”field of science” heading.


3.1     Extraction of Entities and Relations

For extracting entities and relations, we exploited the following resources:

 – An extractor framework designed by [10], which is based on Deep Learning
   models and provides tools for detecting entities and relations from scientific
   literature. It detects six types of entities (Task, Method, Metric, Material,
   Other-Scientific-Term, and Generic) and seven types of relations among
   a list of predefined choices (Compare, Part-of, Conjunction, Evaluate-for,
   Feature-of, Used-for, Hyponym-Of ).
5
    http://wit.istc.cnr.it/stlab-tools/fred/
24     Buscaldi et al.

 – OpenIE [1] provided with Stanford Core NLP6 . It detects general entities
   and relations among them, especially those which can be derived by verbs.
 – The CSO Classifier [18], a tool for automatically classifying research papers
   according to the Computer Science Ontology (CSO)7 [19] which is a com-
   prehensive automatically generated ontology of research areas in the field of
   Computer Science.

    We processed each sentence from the abstract and used the three tools to
assign to each sentence si a list of entities Ei and a list of relations Ri . For
each sentence si , we firstly extracted entities and relations with the extractor
framework, and saved them in two lists (Ei and Ri , respectively). We discarded
all relations with type CONJUNCTION because they were too generic. Then,
we used CSO to extract all Computer Science topics from the sentence, further
expanding Ei . Finally, we processed each sentence si with OpenIE, and retrieved
all triples composed by subject, verb, and object in which both subject and
object were in the set of entities Ei .


3.2   Entities Refinement

Different entities in Ei may actually refer to the same concept with alternative
forms (e.g., machine learning, machine learning methods, machine-learning).
    In this section, we report the methodology used to merge these entities when
they appeared together in the same abstract.
    Cleaning up entities. First, we removed punctuation (e.g., dots and apos-
trophes) and stop-words (e.g., pronouns). Then we merged singular and plural
forms by using the WordNet Lemmatizer available in the NLTK8 library.
    Splitting entities. Some entities actually contained multiple compound ex-
pressions, e.g., machine learning and data mining. Therefore, we split entities
when they contained the conjunction and. Referring to our example, we obtained
the two entities machine learning and data mining.
    Handling Acronyms. Acronyms are usually defined, appearing the first
time near their extended form (e.g., Computer Science Ontology (CSO)) and
then by themselves in the rest of the abstract (e.g., CSO). In order to map
acronyms with their extended form in a specific abstract we use a regular ex-
pression. We then substituted every acronym (e.g., CSO) in the abstract with
their extended form (e.g., Computer Science Ontology).


3.3   Graph Generation

In order to generate the graph, we need to integrate all triples extracted from
the abstracts. In this phase we have to address three main issues. First, mul-
tiple entities derived from different abstracts may refer to the same concept.
6
  https://stanfordnlp.github.io/CoreNLP/
7
  http://cso.kmi.open.ac.uk
8
  https://www.nltk.org/
     Mining Scholarly Data for Fine-Grained Knowledge Graph Construction                 25

Secondly, multiple relationships derived from the verbs in the abstract may be
redundant (e.g., {emphasize, highlight, underline}), Finally, some entities may
be too generic (e.g., paper, approach) and thus useless for a SKG.


Entity Merging For the entity merging task we exploit two data structures.
The first one, labelled W 2LE, maps each word to a list of entities that share the
last token (e.g., medical ontology, biomedical ontology, pervasive agent ontology,
and so on.). With W 2LE we avoided comparing those entities that syntactically
could not refer to the same entity (e.g., the entities ontology generation and
ontology adoption were not compared). The second one, labelled E2E, maps
each original entity to the entity that will represent it in the KG.
       Given an entity e and the list of its tokens {t0 , ..., tn }, we took tn . If tn was
not present in W 2LE, a new entry key tn was added to W 2LE and its value
is a list with e as its unique element. If tn was in W 2LE, then we compute
the Levenshtein string similarity9 between the entity e and all other entities
e00 , ..., e0m ∈ W 2LE[tn ]. If the resulting score met a given threshold tL , the entity
e was mapped as e0i in E2E. Otherwise e was mapped to itself in E2E. At the
end, the entity e was added to W 2LE[tn ]. Finally, the map E2E was used to
select the entities for the graph. For each entry key ex , if its corresponding entity
ey = E2E[ex ] was not in the graph, a new entity with label ey was added.


Relationship Merging After selecting a unique set of entities, we need to take
care the relationships among them. First we cluster all verbs labels in order to
reduce their number. For such a reason, we exploited WordNet [13] and a set of
Word2Vec word embeddings trained on a set of 9 milion research papers from Mi-
crosoft Academic Graph10 . In details, given the set of all verbs V = {v0 , ..., vn },
we built a distance matrix M considering as a distance between two verbs vi
and vj the 1 − W uP almer11 similarity between their synsets. Then, we apply
a hierarchical clustering algorithm, cutting the dendrogram where the number
of clusters had the highest value of overall silhouette-width [5]. Subsequently,
clusters were refined as follows. Given a cluster c, we assigned each verb vic ∈ c
with the word embedding wi in the Word2Vec model, and computed the cen-
troid ce of the cluster as the average of word embeddings of its elements. Then,
we ordered verbs in ascending order by the distance from ce. All verbs with a
distance over a threshold t were discarded. All the other verbs were mapped on
the verb nearest to the centroid ce in a map V 2V .
    Finally, given each pair of entities p = (e1 , e2 ) and their relations {r0 , ..., rn },
we took every relation label lri ∀ri ∈ {r0 , ..., rn }. All relations label coming from
the extractor framework were directly merged into a single label L. All verb
labels were firstly mapped through V 2V and then merged.
9
   https://pypi.org/project/python-Levenshtein/
10
   Avaliable at http://tiny.cc/w0u43y
11
   http://www.nltk.org/howto/wordnet.html
26      Buscaldi et al.

3.4   Detection of Generic Entities

The resulting graph may contain several generic entities (e.g., content, time,
study, article, input, and so on.) In order to discard them we used a frequency-
based filter which detects generic terms by comparing the frequency of the enti-
ties in three set of publications:

 1. the set of 12 007 publications about the Semantic Web;
 2. a set of the same size covering Computer Science but not the Semantic Web;
 3. a set of the same size containing generic papers not about the Semantic Web
    nor the Computer Science.

    For each entity e, we computed the number of times it appeared in the above
datasets, so that we had three different counts c0 , c00 , c000 . Then we computed the
              0               0
ratios r0 = cc00 and r00 = cc000 . If the ratio r0 met a threshold t0 , and the ratio r00
met a threshold t00 the entity e was included in the graph.
    In addition, we automatically preserved all entities within a whitelist com-
posed by CSO topics and all the paper keywords in the initial dataset.


4     The Knowledge Graph
In this section, we report our preliminary results about the KG produced from
12 007 papers about the Semantic Web. We used the following parameters tL =
0.9, t0 = 2, and t00 = 3, which have been determined empirically. The resulting
KG has 10 425 entities and 25 655 relationships.
                    Table 1. Examples of relationships in the KG.

             Subject Entity               Relation            Object Entity
           content integration               help              linked data
           context reasoning               support        web ontology language
      machine readable information       PART-OF              semantic web
             semantic wikis              USED-FOR          query interpretation
         semantic relationship            establish       semantic link network
         semantic relationship            determine              wordnet


    Table 1 reports as example some relationships extracted by our framework.
The KG contains both verb-based relations (from OpenEI, in lowercase) and de-
fault relations (from the Extractor Framework, in uppercase). Verbs are usually
more informative, but also harder to extract. Conversely, the Extractor Frame-
work is more flexible and it is able to extract a large number of relationships, but
these are usually less specific. Using both systems allows us obtaining a good bal-
ance between coverage and specificity. Naturally, this set of relationships could
also be expanded by reasoning methods. For instance, the last two relationships
in Table 1 could be used to infer that wordnet is most likely a semantic link
network.
      Mining Scholarly Data for Fine-Grained Knowledge Graph Construction      27
      Table 2. Contribution of Extractor Framework and CSO to the KG entities.

             Tools Entities Contribution          Count       Percentage
                           CSO                     1 034          9.92%
                   Extractor Framework             8 668         83.15%
                      Exclusive CSO                 117           1.12%
             Exclusive Extractor Framework         7 751         74.35%
           Entities where both tools contribute     917           8.8%
                     Derived Entities              1 640         15.73%

4.1 Graph Statistics
In this section, we report some statistics about entities and relations of our KG.
Table 2 reports statistics about entities. To weigh the actual contribution of
each tool, we counted the number of entities that were detected by applying
each tool. With the label Exclusive we indicate the number of entities detected
only by that underlying tool. The row Derived Entities refers to the additional
entities that were obtained by merging or splitting the original entities. The
majority of entities that are present in the resulting KG comes from the Extractor
Framework tool which contributes to the 83.15% of all entities, and exclusively
contributes to 74.35% of them. The CSO Classifier contributes with 9.92%, but
only a minority are exclusive. This was expected, since CSO contains fairly
established research topics that appeared in a minimum of 50 papers in the
dataset from which it was generated [16]. Conversely, the Extractor Framework
is able to identify many long tail entities [12] that may only appear in few research
papers. It is worth nothing that in the final KG, 15.73% of all entities are different
from the original ones due to the transformations we applied. On average, each
entity was extracted 3.69 times by one of the tools, with a maximum of 52 and
a minimum of 1.

  Table 3. Contribution of Extractor Framework and OpenIE to the KG relations.

            Tools Relations Contribution          Count      Percentage
                  Extractor Framework             23 624        92.09%
                        OpenIE                     3 116        12.15%
             Exclusive Extractor Framework        22 539        87.85%
                   Exclusive OpenIE                2 031         7.92%
               Contribution of both tools          1 085         4.23%

   Similarly to entities, the Extractor Framework produced also the majority of
the relations with a coverage of 92.09%, 87.85% of which exclusive to this tool.
However, the 12.15% of relations extracted by OpenIE are usually more infor-
mative since they are mapped to specific verbs. On average, each relationship
was extracted 1.32 times, with a maximum of 54 and a minimum of 1.

4.2     Limitations
In this section, we analyze some key entities of the Semantic Web and highlight
some issues that still need to be solved to automatically produce high quality
28      Buscaldi et al.

SKGs. In order to focus on specific subsections of the KG, we extracted three
subgraphs containing all the entities directly linked to ontology, natural language
processing, and artificial intelligence. For the sake of space, in the following
figures we display only the most representative relationships between each pair
of entities, considering the following priority order: any verb extracted from
OpenEI, Used-for, Part-of, Feature-of, Hyponym-Of, Evaluate-for, Compare


                     (a)                                         (b)

Fig. 1. The subgraph of ontology. (a) A snippet where many entities related to ontology
are showed. (b) A snippet where relations between its nearest entities are showed.


    Figure 1 shows the subgraph of ontology, which is very dense since this con-
cept is very well represented in the input dataset. The ontology entity was cor-
rectly connected to several relevant entities as semantics, knowledge base, ontol-
ogy language, description logic and so on.
     The subgraph of the natural language processing entity is showed in Figure 2a.
It is less dense than that in Figure 1, since the natural language processing entity
is less represented in the input dataset. The subgraph highlights an important
issue that needs to be addressed. The entities natural language processing and nlp
were not merged. This problem is due to the fact that acronyms are managed at
abstract level, but not at graph level. Another issue regards the distinctive lack
of verb-based relations, which are often useful to better specify a relationship
between two entities.
   Similar considerations also apply to Figure 2b which shows the subgraph of
the artificial intelligence entity. Some relationships between significant entities
appear to be missing. For instance, machine learning and artificial intelligence
are not connected here because they were originally linked by the CONJUNC-
TION relations, which was able to detect entities listed together, but we dis-
carded since it is too generic. Another reason can be identified in textual forms
that our approach may not be able to detect. We thus need to improve our
pipeline to be able to handle similar instances and infer more specific relation-
ships.
     Mining Scholarly Data for Fine-Grained Knowledge Graph Construction            29


                    (a)                                         (b)


Fig. 2. The subgraph of (a) natural language processing and (b) artificial intelligence
.
5    Conclusion and Future Work
In this paper we described a preliminary workflow for producing a Scientific
Knowledge Graph from the text of research publication. We analysed a SKG
derived from a set of 12 007 publication in the field of Semantic Web, with the aim
of gaining a better understanding of the open problems that need to be solved
when addressing this task. In summary, the analysis presented in this paper
highlighted two main challenges. The first regards the disambiguation of entities
that need to be further improved by also considering their semantic similarity.
We also need to be able to resolve acronyms at a graph level by inferring to
which extended form a certain acronym refers to in a specific publication. This
may be addressed by representing entities according to word embedding learned
from the input data or relevant textual resources. However, this representations
would not consider long-tail entities that appear in few research papers. The
second challenge regards the specificity of the relationships. While the Extractor
Framework is quite good at extracting a large number of relationships, many of
them are too generic. We thus intend to experiment with other techniques that
combine Deep Learning and NLP for deriving specific predicates from research
publications. Furthermore, we aim at validating the SKGs by human experts
through a precision-recall analysis.


Acknowledgments
Danilo Dessı̀ acknowledges Sardinia Regional Government for the financial sup-
port of his PhD scholarship (P.O.R. Sardegna F.S.E. 2014-2020).

References

 1. Angeli, G., Premkumar, M.J.J., Manning, C.D.: Leveraging linguistic structure for
    open domain information extraction. In: Proceedings of the 53rd Annual Meeting
    of the ACL and the 7th IJCNLP. vol. 1, pp. 344–354 (2015)
30      Buscaldi et al.

 2. Auer, S., Kovtun, V., Prinz, M., Kasprzik, A., Stocker, M., Vidal, M.E.: Towards
    a knowledge graph for science. In: Proceedings of the 8th International Conference
    on Web Intelligence, Mining and Semantics. p. 1. ACM (2018)
 3. Corcoglioniti, F., Rospocher, M., Aprosio, A.P.: A 2-phase frame-based knowledge
    extraction framework. In: Proceedings of the 31st Annual ACM Symposium on
    Applied Computing. pp. 354–361. ACM (2016)
 4. Curran, J.R., Clark, S., Bos, J.: Linguistically motivated large-scale nlp with c&c
    and boxer. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive
    Poster and Demonstration Sessions. pp. 33–36 (2007)
 5. Dessı̀, D., Recupero, D.R., Fenu, G., Consoli, S.: A recommender system of medical
    reports leveraging cognitive computing and frame semantics. In: Machine Learning
    Paradigms, pp. 7–30. Springer (2019)
 6. Ehrlinger, L., Wöß, W.: Towards a definition of knowledge graphs. SEMANTiCS
    (Posters, Demos, SuCCESS) 48 (2016)
 7. Gábor, K., Buscaldi, D., Schumann, A.K., QasemiZadeh, B., Zargayouna, H.,
    Charnois, T.: Semeval-2018 task 7: Semantic relation extraction and classifica-
    tion in scientific papers. In: Proceedings of The 12th International Workshop on
    Semantic Evaluation. pp. 679–688 (2018)
 8. Gangemi, A., Presutti, V., Recupero, D.R., Nuzzolese, A., et al.: Semantic Web
    Machine Reading with FRED. Semantic Web 8(6), 873–893 (2017)
 9. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes,
    P.N., et al.: Dbpedia–a large-scale, multilingual knowledge base extracted from
    wikipedia. Semantic Web 6(2), 167–195 (2015)
10. Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of en-
    tities, relations, and coreference for scientific knowledge graph construction. In:
    Proceedings of the EMNLP 2018 Conference. pp. 3219–3232 (2018)
11. Martinez-Rodriguez, J.L., Lopez-Arevalo, I., Rios-Alvarado, A.B.: Openie-based
    approach for knowledge graph construction from text. Expert Systems with Ap-
    plications 113, 339–355 (2018)
12. Mesbah, S., Lofi, C., Torre, M.V., Bozzon, A., Houben, G.J.: Tse-ner: An iterative
    approach for long-tail entity extraction in scientific publications. In: ISWC. pp.
    127–143. Springer (2018)
13. Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM
    38(11), 39–41 (1995)
14. Moro, A., Raganato, A., Navigli, R.: Entity linking meets word sense disambigua-
    tion: a unified approach. Transactions of the Association for Computational Lin-
    guistics 2, 231–244 (2014)
15. Nuzzolese, A.G., Gentile, A.L., Presutti, V., Gangemi, A.: Conference linked data:
    the scholarlydata project. In: ISWC. pp. 150–158. Springer (2016)
16. Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate se-
    mantic topic networks. In: ISWC. pp. 408–424. Springer (2015)
17. Peroni, S., Shotton, D., Vitali, F.: One year of the opencitations corpus. In: ISWC.
    pp. 184–192. Springer (2017)
18. Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F., Motta, E.:
    Classifying research papers with the computer science ontology. In: ISWC
    (P&D/Industry/BlueSky). CEUR Workshop Proceedings. vol. 2180
19. Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F., Motta, E.: The
    computer science ontology: a large-scale taxonomy of research areas. In: ISWC.
    pp. 187–205 (2018)