ASKG: An Approach to Enrich Scholarly Knowledge
                                Graphs through Paper Decomposition with Deep
                                Learning
                                Bowen Zhang, Sergio J. Rodríguez-Méndez and Pouya Ghiasnezhad Omran
                                Australian National University, Canberra ACT 2601, AU


                                                                      Abstract
                                                                      Knowledge Graphs (KGs) play a pivotal role in the field of artificial intelligence, yet the construction
                                                                      of such graphs often requires significant human resources. Automated KG construction technologies
                                                                      are key to achieving large-scale KGs construction. To address this, we have developed an automated
                                                                      Knowledge Graph Construction Pipeline (KGCP) and applied it to the enhancement of the Australian
                                                                      National University (ANU) Scholarly Knowledge Graph (ASKG), which comprehensively represents
                                                                      not only the metadata but also the scholarly knowledge encapsulated in the academic papers. This
                                                                      study introduces an innovative, automatic approach to KGs construction using an array of Natural
                                                                      Language Processing (NLP) techniques. Leveraging Named Entity Recognition (NER) models, key
                                                                      academic entities related to computer science are efficiently identified, such as Research Problems,
                                                                      Methods, Solution, Tool, Resource, Dataset, and Language. The ASKG is further enriched through
                                                                      Named Entity Linking (NEL) with Wikidata, keyword extraction, automatic summarisation, and the
                                                                      integration of entities from the Metadata Extractor & Loader and The NLP-NER Toolkit (MEL & TNNT).


                                                                      Keywords
                                                                      Knowledge Graph, Named Entity Recognition, Name Entity Linking, Deep Learning, Information Extrac-
                                                                      tion, Knowledge Graph Construction


                                1. Introduction and Related Work
                                Academic KGs have been a focus in the field of cognitive intelligence. However, these KGs
                                often concentrate on high-level metadata of papers, such as the author, date, venue, etc., while
                                the in-depth exploration of paper content is often overlooked. This limitation hinders the full
                                interpretation and utilisation of detailed knowledge within academic papers.
                                   Addressing this issue is crucial as it can guide deeper analysis, identify emerging academic
                                trends, reduce Large Language Models (LLMs) hallucination problem as well as enhance the
                                training outcome of LLMs [1]. To tackle this, we implemented the PARSE (Papers And Relation-
                                ships Semantic Extraction) component within our broader KGCP project, which decomposes


                                ISWC 2023 Posters and Demos: 22nd International Semantic Web Conference, November 6–10, 2023, Athens, Greece
                                Envelope-Open Bowen.Zhang01@outlook.com (B. Zhang); Sergio.RodriguezMendez@anu.edu.au (S. J. Rodríguez-Méndez);
                                P.G.Omran@anu.edu.au (P. G. Omran)
                                Orcid 0000-0001-6045-8599 (B. Zhang); 0000-0001-7203-8399 (S. J. Rodríguez-Méndez); 0000-0002-4473-3877
                                (P. G. Omran)
                                                                    © 2023 Copyright ©2023 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
academic papers and employs various NLP techniques and models for detailed knowledge
extraction.
   Numerous projects have been developed in the domain of academic KGs, such as AMiner [2],
AceKG [2], and MAKG [3], which aggregate extensive information on researchers, publications,
and citation relationships. However, their focus on fine-grained knowledge within papers is
often insufficient.
   ORKG [4] represents scholarly knowledge as structured data but it lacks detailed content anal-
ysis and fine-grained knowledge extraction. Other tools, like OpenAIRE [2] and ResearchRabbit
[5], focus on promoting open academic exchange and offering functionalities like literature
search and personalised summaries, visualisation, etc.
   This paper presents an innovative approach to constructing KGs, emphasising the extraction
of fine-grained knowledge from scholarly papers to enrich ASKG. Unlike the above-mentioned
systems, our methodology entails section-wise parsing of academic papers adhering to the
IMRaD (Introduction, Method, Results, and Discussion) structure. Many academic papers
essentially adhere to the IMRaD structure. For those that do not follow the IMRaD format, we
are in the process of implementing new tools and ontologies that can be customized according
to the specific structure of each paper. We employ NLP techniques such as NER, NEL, automatic
summarisation, and keyword extraction, individually applied to each IMRaD segment. This
specific strategy distinctly positions ASKG from other KGs, offering a significant edge in
gathering and processing academic data. Detailed comparisons with these platforms and
tools can be found in our GitHub repository. Initial results suggest our method significantly
enriches academic knowledge graphs, offering a more comprehensive and diverse data set, thus
exemplifying the efficiency of our decomposition and refinement approach in knowledge graph
construction.


2. KGCP Architecture: PARSE extension
Our ultimate goal is to expand the academic KGs by automatically extracting fine-grained
knowledge through the structural decomposition of the documents (research papers). To achieve
this, we are implementing and extending our KGCP1 pipeline. As a key component of KGCP,
the PARSE component is specifically focused on enriching ASKG by extracting meaningful
knowledge from academic papers related to computer science.
   As shown in Figure 1, firstly, we utilise web crawling to access ANU’s target sources (academic
web pages), MAKG, ScholarlyData, etc., automatically extracting information on researchers
and their papers to build an academic paper dataset. We subsequently generate JSON files
depicting paper metadata. PARSE operates in two primary phases. The first entails importing
papers into the MEL & TNNT systems [6][7], extracting metadata, raw text, and general entities
to enrich the existing ASKG.
   In the second phase, PARSE extracts scientific field-specific knowledge from academic papers.
Paper metadata described in JSON files is fed into a statistical analyser to obtain document set
metadata and identify computer science papers. Using the targeted paper list, we send HTTP
requests to the TNNT RESTful API, fetching the papers’ original text. Then PARSE processes
1
    See: https://w3id.org/kgcp/, especially https://w3id.org/kgcp/PARSE
Figure 1: PARSE Structure


academic papers, segmenting them based on the IMRaD structure. We design and employ
transformer-based NER models with RoBERTa, SciBERT, LinkBERT, etc. The text is sent to
the NER module to identify computer science-related academic entities which are categorised
as Research Problems, Methods, Solution, Tool, Resource, Dataset, and Language. Academic
entities from the NER module are linked with Wikidata entities to enhance our knowledge
graph. Meanwhile, we send different parts of the paper to the automatic summarisation model,
BRIO, to generate summaries, and to the keyword model, KeyBERT, for keyword identification.
All the outputs are processed to enrich the academic knowledge graph.


3. Evaluation, Discussion, and Current Work

Table 1
Comparison between the original and enriched ASKG
     Metrics                             Original ASKG      Enriched ASKG      Change (%)
     Number of relation types                   21                 40            +90.48
     Number of entity types                     18                 46           +155.56
     Number of entities                      235,314           1,215,106        +416.38
     Number of triples                      1,048,576          2,866,980        +173.42
     Average degree                            4.46               4.72           +5.83
     Clustering coefficient                 6.28e-05            1.86e-04        +196.18
     Number of connected components              6                  1            -83.33
     Information density                       4.46               2.36           -47.09

   The comparison between the original and enriched ASKG, as shown in Table 1, reveals signif-
icant growth in aspects such as the number of relation types, entity types, entities, and triples,
indicating enhanced structural diversity and information capacity. However, the information
density has decreased, suggesting the enriched ASKG has become more sparse, posing a new
research direction.
   Listing 1 shows a portion of the output from PARSE. Unlike most traditional academic
knowledge graphs and previous ASKG, the enriched ASKG not only includes high-level abstract
metadata such as authors and publication dates, but also contains more detailed academic
information. This information includes, but is not limited to, keywords and a summary in each
academic paper section, as well as more specific academic concepts like academic entities in
each sentence and their locations in the academic paper.

@ p r e f i x askg − d a t a : < h t t p s : / /www. anu . edu . au / d a t a / s c h o l a r l y / > .
@ p r e f i x askg − o n t o : < h t t p s : / /www. anu . edu . au / o n t o / s c h o l a r l y # > .
@ p r e f i x domo: < h t t p s : / /www. anu . edu . au / o n t o / domo# > .
......

askg − d a t a : P a p e r − 5 0 0 3 6 8 1 f a 6 a 9 1 4 a askg − o n t o : P a p e r ;
     r d f s : l a b e l ‘ ‘ [ S P I C E : S e m a n t i c P r o p o s i t i o n a l Image C a p t i o n E v a l u a t i o n ] − [ P e t e r Anderson ] − [ 2 0 1 6 ] ’ ’@en ;
     askg − o n t o : h a s S e c t i o n askg − d a t a : A b s t r a c t −1 f 3 5 f 0 4 2 4 3 f 7 3 0 ,
             askg − d a t a : D i s c u s s i o n − f c 3 b b 8 b 3 0 0 7 7 1 b ,
             askg − d a t a : E x p e r i m e n t − d c 4 8 c 6 d 0 8 1 8 6 a 7 ,
              ......
     askg − o n t o : p a p e r L i n k ‘ ‘ h t t p : / / a r x i v . o r g / a b s / 1 6 0 7 . 0 8 8 2 2 v1 ’ ’ ^^ x s d : s t r i n g .

askg − d a t a : A b s t r a c t −1 f 3 5 f 0 4 2 4 3 f 7 3 0 a askg − o n t o : A b s t r a c t ;
     r d f s : l a b e l ‘ ‘ Paper −[ S P I C E : S e m a n t i c P r o p o s i t i o n a l Image C a p t i o n E v a l u a t i o n ] − [ P e t e r Anderson ] − [ 2 0 1 6 ] | S e c t i o n −[
                 A b s t r a c t ] ’ ’@en ;
     domo:keyword askg − d a t a : K e y w o r d O f S e c t i o n −0619 f 5 f d 0 a b 6 a 4 ,
     askg − o n t o : c o n t a i n s askg − d a t a : E x c e r p t − e d 1 f c 3 d d 5 c 0 8 a b ,
     askg − onto:summary ‘ ‘ S P I C E : S e m a n t i c P r o p o s i t i o n a l Image C a p t i o n i s a new a u t o m a t e d c a p t i o n e v a l u a t i o n m e t r i c
                 . . . . . . ’ ’ ^^ x s d : s t r i n g .

askg − d a t a : E x c e r p t − e d 1 f c 3 d d 5 c 0 8 a b r d f s : l a b e l ‘ ‘ Paper −[ ’ S P I C E : ␣ S e m a n t i c ␣ P r o p o s i t i o n a l ␣ Image ␣ C a p t i o n ␣ E v a l u a t i o n ’ ] |
        S e c t i o n −[ ’ A b s t r a c t ’ ] | E x c e r p t − [ 2 0 7 ] − [ 2 0 8 ] ’ ’@en ;
     askg − o n t o : i n S e n t e n c e ‘ ‘ t h e r e i s c o n s i d e r a b l e i n t e r e s t i n t h e t a s k o f g e n e r a t i n g a u t o m a t i c a l l y image c a p t i o n s
                 image c a p t i o n s [ 1 , 2 ] ’ ’ ^^ x s d : s t r i n g ;
     askg − o n t o : m e n t i o n s askg − d a t a : A c a d e m i c E n t i t y − i m a g e _ c a p t i o n − Q39161486 ;
     askg − o n t o : w o r d I n d e x F r o m ‘ ‘ 2 0 7 ’ ’ ^^ x s d : i n t ;
     askg − o n t o : w o r d I n d e x T o ‘ ‘ 2 0 8 ’ ’ ^^ x s d : i n t .

askg − d a t a : A c a d e m i c E n t i t y − i m a g e _ c a p t i o n − Q39161486 r d f s : l a b e l      ‘ ‘ image c a p t i o n ’ ’ ^^ x s d : s t r i n g ;
     owl:sameAs wd:Q39161486 ;
     s k o s : b r o a d e r askg − o n t o : R e s e a r c h P r o b l e m .
......


                                                                Listing 1: Examples of PARSE output
   With the enhanced ASKG, we propose a range of innovative use cases. One such use case
is knowledge graph-based research trend analysis, illustrated in Table 2. While our study
primarily focuses on capturing the dynamic evolution of academic research trends at the ANU,
the methodology is designed to be adaptable and can be applied to other institutions as well. By
executing SPARQL queries, we extract relevant data from the KGs and carry out a quantitative
analysis, identifying the most mentioned academic entities and research problems, which can
be interpreted as current research trends of the university’s academic sources.

                                                                                    Frequency                                Frequency
               Rank               Research Problem                                                                                                                   Rank Change
                                                                                  up to Jun. 2022                          up to Dec. 2022
               1                  Optical Flow                                    230                                      260                                                  +1 △
               2                  Modal Logic                                     231                                      258                                                  -1 ▽
               3                  Image Captioning                                144                                      180                                                  +1 △
               4                  Blur Kernel                                     101                                      173                                                  +1 △
               5                  Action Recognition                              168                                      171                                                  -2 ▽
Table 2
Example of Research Trend Analysis with ASKG

     Research Trend Analysis can be used for academic performance management, resource
allocation, etc. It’s worth noting that performing this level of refined analysis is challenging
within traditional academic KGs that only include paper metadata. This is mainly because the
metadata typically does not encompass in-depth descriptions of specific research problems
or other academic knowledge, limiting our ability for a deep understanding of the dynamics
within the research field. In contrast, our enriched ASKG can capture more information, thereby
facilitating more detailed trend analysis.
   Moreover, the enriched ASKG has a wider range of application scenarios, such as research
relationship mining. By integrating diverse data including authors, research interests, academic
entities, and summaries, it enables the discovery of overlooked patterns and potential cross-
disciplinary collaborations between researchers through graph mining.
   Currently, we continue applying the PARSE to other disciplines, such as astronomy and
physics. Simultaneously, we are developing an innovative semantic query processing system
(as an additional component of the KGCP) that combines LLMs with the enriched ASKG, aiming
to improve the efficiency of academic information queries and the accuracy of context-based
information retrieval from LLMs. In this system, user queries are translated into triple formats
and then processed using SPARQL for graph matching, thereby supplying LLMs with more
accurate and complete academic information.
   We continue investigating and optimising the application of the LLMs and KGs in semantic
searches and KG construction-related tasks, further advancing the fields of information retrieval
and knowledge representation.


References
[1] F. Moiseev, Z. Dong, E. Alfonseca, M. Jaggi, SKILL: Structured Knowledge Infusion for
    Large Language Models, arXiv preprint arXiv:2205.08184 (2022).
[2] M. Nayyeri, G. M. Cil, S. Vahdati, F. Osborne, M. Rahman, S. Angioni, A. Salatino, D. R.
    Recupero, N. Vassilyeva, E. Motta, et al., Trans4E: Link prediction on scholarly knowledge
    graphs, Neurocomputing 461 (2021) 530–542.
[3] M. Färber, L. Ao, The Microsoft Academic Knowledge Graph enhanced: Author name
    disambiguation, publication classification, and embeddings, Quantitative Science Studies 3
    (2022) 51–98.
[4] M. Y. Jaradeh, A. Oelen, K. E. Farfar, M. Prinz, J. D’Souza, G. Kismihók, M. Stocker, S. Auer,
    Open research knowledge graph: next generation infrastructure for semantic scholarly
    knowledge, in: Proceedings of the 10th International Conference on Knowledge Capture,
    2019, pp. 243–246.
[5] R. Sharma, S. Gulati, A. Kaur, A. Sinhababu, R. Chakravarty, Research discovery and
    visualization using ResearchRabbit: A use case of AI in libraries, COLLNET Journal of
    Scientometrics and Information Management 16 (2022) 215–237.
[6] S. J. Rodríguez Méndez, P. G. Omran, A. Haller, K. Taylor, MEL: Metadata Extractor &
    Loader, in: ISWC (Posters/Demos/Industry), 2021.
[7] S. Seneviratne, S. J. Rodríguez Méndez, X. Zhang, P. G. Omran, K. Taylor, A. Haller, TNNT:
    The Named Entity Recognition Toolkit, in: Proceedings of the 11th on Knowledge Capture
    Conference, 2021, pp. 249–252.