<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Approach to Enrich Scholarly Knowledge Graphs through Paper Decomposition with Deep Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bowen Zhang</string-name>
          <email>Bowen.Zhang01@outlook.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio J. Rodríguez-Méndez</string-name>
          <email>Sergio.RodriguezMendez@anu.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pouya Ghiasnezhad Omran</string-name>
          <email>P.G.Omran@anu.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Knowledge Graph, Named Entity Recognition, Name Entity Linking, Deep Learning, Information Extrac-</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Australian National University</institution>
          ,
          <addr-line>Canberra ACT 2601, AU</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Methods</institution>
          ,
          <addr-line>Solution, Tool, Resource, Dataset, and Language. The ASKG is further enriched through</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>6</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Knowledge Graphs (KGs) play a pivotal role in the field of artificial intelligence, yet the construction of such graphs often requires significant human resources. Automated KG construction technologies are key to achieving large-scale KGs construction. To address this, we have developed an automated Knowledge Graph Construction Pipeline (KGCP) and applied it to the enhancement of the Australian National University (ANU) Scholarly Knowledge Graph (ASKG), which comprehensively represents not only the metadata but also the scholarly knowledge encapsulated in the academic papers. This study introduces an innovative, automatic approach to KGs construction using an array of Natural Language Processing (NLP) techniques. Leveraging Named Entity Recognition (NER) models, key academic entities related to computer science are eficiently identified, such as Research Problems, Named Entity Linking (NEL) with Wikidata, keyword extraction, automatic summarisation, and the integration of entities from the Metadata Extractor &amp; Loader and The NLP-NER Toolkit (MEL &amp; TNNT).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction and Related</title>
    </sec>
    <sec id="sec-3">
      <title>Work</title>
      <p>Academic KGs have been a focus in the field of cognitive intelligence. However, these KGs
often concentrate on high-level metadata of papers, such as the author, date, venue, etc., while
the in-depth exploration of paper content is often overlooked. This limitation hinders the full
interpretation and utilisation of detailed knowledge within academic papers.</p>
      <p>
        Addressing this issue is crucial as it can guide deeper analysis, identify emerging academic
trends, reduce Large Language Models (LLMs) hallucination problem as well as enhance the
training outcome of LLMs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To tackle this, we implemented the PARSE (Papers And
Relationships Semantic Extraction) component within our broader KGCP project, which decomposes
nEvelop-O
(P. G. Omran)
CEUR
Workshop
Proceedings
academic papers and employs various NLP techniques and models for detailed knowledge
extraction.
      </p>
      <p>
        Numerous projects have been developed in the domain of academic KGs, such as AMiner [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
AceKG [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and MAKG [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which aggregate extensive information on researchers, publications,
and citation relationships. However, their focus on fine-grained knowledge within papers is
often insuficient.
      </p>
      <p>
        ORKG [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] represents scholarly knowledge as structured data but it lacks detailed content
analysis and fine-grained knowledge extraction. Other tools, like OpenAIRE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and ResearchRabbit
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], focus on promoting open academic exchange and ofering functionalities like literature
search and personalised summaries, visualisation, etc.
      </p>
      <p>This paper presents an innovative approach to constructing KGs, emphasising the extraction
of fine-grained knowledge from scholarly papers to enrich ASKG. Unlike the above-mentioned
systems, our methodology entails section-wise parsing of academic papers adhering to the
IMRaD (Introduction, Method, Results, and Discussion) structure. Many academic papers
essentially adhere to the IMRaD structure. For those that do not follow the IMRaD format, we
are in the process of implementing new tools and ontologies that can be customized according
to the specific structure of each paper. We employ NLP techniques such as NER, NEL, automatic
summarisation, and keyword extraction, individually applied to each IMRaD segment. This
specific strategy distinctly positions ASKG from other KGs, ofering a significant edge in
gathering and processing academic data. Detailed comparisons with these platforms and
tools can be found in our GitHub repository. Initial results suggest our method significantly
enriches academic knowledge graphs, ofering a more comprehensive and diverse data set, thus
exemplifying the eficiency of our decomposition and refinement approach in knowledge graph
construction.</p>
    </sec>
    <sec id="sec-4">
      <title>2. KGCP Architecture: PARSE extension</title>
      <p>Our ultimate goal is to expand the academic KGs by automatically extracting fine-grained
knowledge through the structural decomposition of the documents (research papers). To achieve
this, we are implementing and extending our KGCP1 pipeline. As a key component of KGCP,
the PARSE component is specifically focused on enriching ASKG by extracting meaningful
knowledge from academic papers related to computer science.</p>
      <p>
        As shown in Figure 1, firstly, we utilise web crawling to access ANU’s target sources (academic
web pages), MAKG, ScholarlyData, etc., automatically extracting information on researchers
and their papers to build an academic paper dataset. We subsequently generate JSON files
depicting paper metadata. PARSE operates in two primary phases. The first entails importing
papers into the MEL &amp; TNNT systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], extracting metadata, raw text, and general entities
to enrich the existing ASKG.
      </p>
      <p>In the second phase, PARSE extracts scientific field-specific knowledge from academic papers.
Paper metadata described in JSON files is fed into a statistical analyser to obtain document set
metadata and identify computer science papers. Using the targeted paper list, we send HTTP
requests to the TNNT RESTful API, fetching the papers’ original text. Then PARSE processes</p>
      <sec id="sec-4-1">
        <title>1See: https://w3id.org/kgcp/, especially https://w3id.org/kgcp/PARSE</title>
        <p>academic papers, segmenting them based on the IMRaD structure. We design and employ
transformer-based NER models with RoBERTa, SciBERT, LinkBERT, etc. The text is sent to
the NER module to identify computer science-related academic entities which are categorised
as Research Problems, Methods, Solution, Tool, Resource, Dataset, and Language. Academic
entities from the NER module are linked with Wikidata entities to enhance our knowledge
graph. Meanwhile, we send diferent parts of the paper to the automatic summarisation model,
BRIO, to generate summaries, and to the keyword model, KeyBERT, for keyword identification.
All the outputs are processed to enrich the academic knowledge graph.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3. Evaluation, Discussion, and Current Work</title>
      <p>The comparison between the original and enriched ASKG, as shown in Table 1, reveals
significant growth in aspects such as the number of relation types, entity types, entities, and triples,
indicating enhanced structural diversity and information capacity. However, the information
density has decreased, suggesting the enriched ASKG has become more sparse, posing a new
research direction.</p>
      <p>Listing 1 shows a portion of the output from PARSE. Unlike most traditional academic
knowledge graphs and previous ASKG, the enriched ASKG not only includes high-level abstract
metadata such as authors and publication dates, but also contains more detailed academic
information. This information includes, but is not limited to, keywords and a summary in each
academic paper section, as well as more specific academic concepts like academic entities in
each sentence and their locations in the academic paper.
@prefix askg − d a t a : &lt; h t t p s : / /www. anu . edu . au / d a t a / s c h o l a r l y / &gt; .
@prefix askg − o n t o : &lt; h t t p s : / /www. anu . edu . au / onto / s c h o l a r l y #&gt; .
@prefix domo: &lt; h t t p s : / /www. anu . edu . au / onto / domo#&gt; .
. . . . . .
askg − d a t a : P a p e r −5003681 f a 6 a 9 1 4 a askg − o n t o : P a p e r ;
r d f s : l a b e l ‘ ‘ [ SPICE: Semantic P r o p o s i t i o n a l Image Caption E v a l u a t i o n ] −[ P e t e r Anderson ] − [ 2 0 1 6 ] ’ ’@en ;
askg − o n t o : h a s S e c t i o n askg − d a t a : A b s t r a c t −1 f 3 5 f 0 4 2 4 3 f 7 3 0 ,
askg − d a t a : D i s c u s s i o n − fc3bb8b300771b ,
askg − d a t a : E x p e r i m e n t − dc48c6d08186a7 ,
. . . . . .</p>
      <p>askg − o n t o : p a p e r L i n k ‘ ‘ h t t p : / / a r x i v . org / abs / 1 6 0 7 . 0 8 8 2 2 v1 ’ ’ ^^ x s d : s t r i n g .
askg − d a t a : A b s t r a c t −1 f 3 5 f 0 4 2 4 3 f 7 3 0 a askg − o n t o : A b s t r a c t ;
r d f s : l a b e l ‘ ‘ Paper −[ SPICE: Semantic P r o p o s i t i o n a l Image Caption E v a l u a t i o n ] −[ P e t e r Anderson ] − [ 2 0 1 6 ] | S e c t i o n −[</p>
      <p>A b s t r a c t ] ’ ’@en ;
domo:keyword askg − d a t a : K e y w o r d O f S e c t i o n −0619 f 5 f d 0 a b 6 a 4 ,
askg − o n t o : c o n t a i n s askg − d a t a : E x c e r p t − e d1 f c3 d d5 c 08 ab ,
askg −onto:summary ‘ ‘ SPICE: Semantic P r o p o s i t i o n a l Image Caption i s a new automated c a p t i o n e v a l u a t i o n m e t r i c
. . . . . . ’ ’ ^^ x s d : s t r i n g .
askg − d a t a : E x c e r p t − e d 1 f c 3 d d 5 c 0 8 a b r d f s : l a b e l ‘ ‘ Paper −[ ’ SPICE: ␣ Semantic ␣ P r o p o s i t i o n a l ␣ Image ␣ Caption ␣ E v a l u a t i o n ’ ] |</p>
      <p>
        S e c t i o n −[ ’ A b s t r a c t ’ ] | Excerpt − [ 2 0 7 ] − [ 2 0 8 ] ’ ’@en ;
askg − o n t o : i n S e n t e n c e ‘ ‘ t h e r e i s c o n s i d e r a b l e i n t e r e s t i n t h e t a s k o f g e n e r a t i n g a u t o m a t i c a l l y image c a p t i o n s
image c a p t i o n s [
        <xref ref-type="bibr" rid="ref1 ref2">1 , 2</xref>
        ] ’ ’ ^^ x s d : s t r i n g ;
askg − o n t o : m e n t i o n s askg − d a t a : A c a d e m i c E n t i t y − image_caption −Q39161486 ;
askg − onto:wordIndexFrom ‘ ‘ 2 0 7 ’ ’ ^^ x s d : i n t ;
askg − onto:wordIndexTo ‘ ‘ 2 0 8 ’ ’ ^^ x s d : i n t .
askg − d a t a : A c a d e m i c E n t i t y − image_caption −Q39161486 r d f s : l a b e l ‘ ‘ image c a p t i o n ’ ’ ^^ x s d : s t r i n g ;
owl:sameAs wd:Q39161486 ;
s k o s : b r o a d e r askg − o n t o : R e s e a r c h P r o b l e m .
. . . . . .
      </p>
      <sec id="sec-5-1">
        <title>Listing 1: Examples of PARSE output</title>
        <p>With the enhanced ASKG, we propose a range of innovative use cases. One such use case
is knowledge graph-based research trend analysis, illustrated in Table 2. While our study
primarily focuses on capturing the dynamic evolution of academic research trends at the ANU,
the methodology is designed to be adaptable and can be applied to other institutions as well. By
executing SPARQL queries, we extract relevant data from the KGs and carry out a quantitative
analysis, identifying the most mentioned academic entities and research problems, which can
be interpreted as current research trends of the university’s academic sources.</p>
        <sec id="sec-5-1-1">
          <title>Rank</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Research Problem</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>Frequency</title>
          <p>up to Jun. 2022</p>
        </sec>
        <sec id="sec-5-1-4">
          <title>Frequency</title>
          <p>up to Dec. 2022</p>
        </sec>
        <sec id="sec-5-1-5">
          <title>Rank Change</title>
          <p>1
2
3
4
5</p>
          <p>Optical Flow
Modal Logic
Image Captioning
Blur Kernel
Action Recognition
230
231
144
101
168</p>
          <p>Research Trend Analysis can be used for academic performance management, resource
allocation, etc. It’s worth noting that performing this level of refined analysis is challenging
within traditional academic KGs that only include paper metadata. This is mainly because the
metadata typically does not encompass in-depth descriptions of specific research problems
or other academic knowledge, limiting our ability for a deep understanding of the dynamics
within the research field. In contrast, our enriched ASKG can capture more information, thereby
facilitating more detailed trend analysis.</p>
          <p>Moreover, the enriched ASKG has a wider range of application scenarios, such as research
relationship mining. By integrating diverse data including authors, research interests, academic
entities, and summaries, it enables the discovery of overlooked patterns and potential
crossdisciplinary collaborations between researchers through graph mining.</p>
          <p>Currently, we continue applying the PARSE to other disciplines, such as astronomy and
physics. Simultaneously, we are developing an innovative semantic query processing system
(as an additional component of the KGCP) that combines LLMs with the enriched ASKG, aiming
to improve the eficiency of academic information queries and the accuracy of context-based
information retrieval from LLMs. In this system, user queries are translated into triple formats
and then processed using SPARQL for graph matching, thereby supplying LLMs with more
accurate and complete academic information.</p>
          <p>We continue investigating and optimising the application of the LLMs and KGs in semantic
searches and KG construction-related tasks, further advancing the fields of information retrieval
and knowledge representation.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Moiseev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          , E. Alfonseca,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Jaggi, SKILL: Structured Knowledge Infusion for Large Language Models</article-title>
          ,
          <source>arXiv preprint arXiv:2205.08184</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nayyeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Cil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vahdati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Angioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Recupero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vassilyeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Motta</surname>
          </string-name>
          , et al.,
          <article-title>Trans4E: Link prediction on scholarly knowledge graphs</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>461</volume>
          (
          <year>2021</year>
          )
          <fpage>530</fpage>
          -
          <lpage>542</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Färber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ao</surname>
          </string-name>
          ,
          <article-title>The Microsoft Academic Knowledge Graph enhanced: Author name disambiguation, publication classification, and embeddings</article-title>
          ,
          <source>Quantitative Science Studies</source>
          <volume>3</volume>
          (
          <year>2022</year>
          )
          <fpage>51</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Jaradeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Farfar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prinz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kismihók</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stocker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Knowledge Capture</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kaur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sinhababu</surname>
          </string-name>
          , R. Chakravarty,
          <article-title>Research discovery and visualization using ResearchRabbit: A use case of AI in libraries</article-title>
          ,
          <source>COLLNET Journal of Scientometrics and Information Management</source>
          <volume>16</volume>
          (
          <year>2022</year>
          )
          <fpage>215</fpage>
          -
          <lpage>237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. J. Rodríguez</given-names>
            <surname>Méndez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Omran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Haller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , MEL: Metadata Extractor &amp; Loader, in: ISWC (Posters/Demos/Industry),
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Seneviratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J. Rodríguez</given-names>
            <surname>Méndez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Omran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , A. Haller,
          <article-title>TNNT: The Named Entity Recognition Toolkit</article-title>
          ,
          <source>in: Proceedings of the 11th on Knowledge Capture Conference</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>249</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>