<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>An Enterprise Knowledge Graph Approach</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Blerina Spahiu</string-name>
          <email>blerina.spahiu@unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Lisa Gentile</string-name>
          <email>annalisa.gentile@ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chad DeLuca</string-name>
          <email>delucac@us.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Maurino</string-name>
          <email>andrea.maurino@unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Research - Almaden Lab</institution>
          ,
          <addr-line>650 Harry Rd, San Jose, CA 95120</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi di Milano-Bicocca</institution>
          ,
          <addr-line>Viale Sarca 336, 20126, Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Big organizations have a complex ecosystem of entities: products, people, skills, and intellectual properties. Formally capturing, maintaining, and serving this knowledge is a complex challenge. Enterprise Knowledge Graphs (EKG) are an efective method to represent enterprise information in ways that can be more easily interpreted by both humans and machines. In this study, we concentrate on the EKG's section related to individuals' skills and expertise. We present a method to determine the topics that employees are knowledgeable about, using the text from their scholarly publications and patents. We use publicly available datasets on US patents and scholarly publications and apply Information Extraction techniques to extract skills from the text and represent them in the EKG format. The resulting EKG proves valuable for querying and analyzing employees' skills, helping to identify experts in specific domains.</p>
      </abstract>
      <kwd-group>
        <kwd>knowledge graph construction</kwd>
        <kwd>skills extraction</kwd>
        <kwd>scholarly data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Knowledge Graphs (KGs) are powerful tools for organizing and representing information in a structured
and semantically rich manner [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In addition to the generic and open-world KGs such as DBpedia
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Wikidata [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], most of the current KGs are domain-specific that focus on specific topics or areas
of interest [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] such as economy [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], medicine [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], social science [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], etc. These KGs focusing on a
specific topic have a narrower scope, but they can be more detailed and accurate in the coverage of the
particular domain that they are developed to represent. Despite the numerous available KGs, there is
no “one fits all” solution, and it is often necessary to build, enhance, refine, and enrich domain-specific
knowledge graphs to serve specific use cases.
      </p>
      <p>
        Innovative organizations need to stay current with rapidly evolving research areas [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Access to tools
that ofer up-to-date insights is invaluable for executives making strategic decisions and researchers
looking to improve the state of the art [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In addition, for organizations with thousands of employees
and diverse scientific disciplines, having a centralised resource to integrate and manage all needed
data is crucial. Our idea is that semantically representing and enriching each asset, project, scientific
publication, intellectual property - and eventually employee profile - can significantly enhance and
facilitate all downstream tasks related to skills discovery, trend analysis, expertise matching, and so on.
      </p>
      <p>
        The Semantic Web community has proposed various scholarly KGs like OAG [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], ORKG [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
OpenAlex [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], CS-KG [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and AIDA [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Notable ontologies for organizational resources and
      </p>
      <p>© 2024 This work is licensed under a “CC BY 4.0” license.
CEUR</p>
      <p>ceur-ws.org
However, such KGs and ontologies have some limitations related to (i) coverage - most do not include
information about patents; (ii) representation - none of them represents skills as a direct property of
employees; (iii) functionalities - i.e. do not ofer advanced query capabilities.</p>
      <p>We create a Scholarly Enterprise Knowledge Graph (sEKG) containing an augmented representation
of internal assets (publications, patents, projects, etc) as well as publicly available external assets from
other research and innovation institutions. sEKG is efectively a collection of scholarly data (papers and
patents), integrated in a machine-readable format and enriched with external knowledge resources, to
enable eficient analysis and inference of implicit skills and organisational hierarchy. We bootstrap the
knowledge population task with standard information about each asset, i.e. type of asset (Intellectual
Property, Service, R&amp;D product, Scientific paper, etc.), names, textual descriptions, owners, etc. Then
we enrich each asset, extracting relevant concepts and linking them to external related ontologies
and Linked Data concepts. We extended the W3C Organisational Ontology to represent employees’
knowledge and department organisations. By leveraging the collected and enriched data, we can
infer the skills of individual authors and use this information in diferent use cases, including: (i)
identifying the right experts for specific projects, (ii) empowering skills growth among employees, (iii)
identifying external competitors, and (iv) supporting employers in performing daily tasks. The primary
contributions of this research encompass the following:
• Introduction of sEKG, the scholarly Enterprise Knowledge Graph that integrates scholarly and
patent data, and enriches such data with potential skills for each employee.
• Exploration of diverse sEKG application scenarios.
• Public release of our extended ontology, derived from the W3C Organisational Ontology, tailored
to meet the specific requirements of our sEKG.</p>
      <p>• Presentation of the outcomes of a user study, illustrating the utility and potential of sEKG.</p>
      <p>The remainder of this paper is structured as follows. After reviewing the related work in Section
2 we describe the sEKG creation in Section 3. Section 4 discusses potential application scenarios and
describes a small pilot study. We discuss lessons learned and the potential evolution of this work in
Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Scholarly Knowledge Graphs</title>
        <p>
          Several large knowledge graphs have been proposed in the scholarly field, and they cover vast
information about entities such as publications, authors, and venues [
          <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
          ] or scholarly KGs that focus on
a particular domain of research such as computer science [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], science [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], medicine [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], etc.
        </p>
        <p>
          The first considerable efort to ofer comprehensive semantic descriptions of conference events is
represented by the metadata projects at ESWC 2006 and ISWC 2006 conferences [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], This projected
generated the first version of the Semantic Web Conference Ontology 6 which has been later refactored
[18] and is still used to collect conference data7.
        </p>
        <p>
          Open Academic Graph8 (OAG) is a large knowledge graph unifying two billion-scale academic graphs:
Microsoft Academic Graph (MAG) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and AMiner [19]. As of July 2020, the snapshot of such KG
contains metadata for more than 239 million publications from all scientific disciplines, as well as over
1.38 billion references between publications making it one of the largest freely available scholarly
knowledge graphs. OAG is built automatically by using machine learning algorithms and natural
language processing techniques to extract and link information from academic papers. OAG does not
provide APIs or tools for querying and analyzing data; it only provides links to download bulk data9.
6Semantic Web Conference Ontology http://data.semanticweb.org/ns/swc/swc_2009-05-09.html
7http://www.scholarlydata.org/
8https://www.microsoft.com/en-us/research/project/open-academic-graph/
9https://www.aminer.cn/oag-2-1
        </p>
        <p>
          Open Research Knowledge Graph10 (ORKG) is an open and collaborative platform that has the aim
to integrate research and academic knowledge in a structured way [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The ORKG contains not
only entities regarding research papers, authors, institutions, research topics, and concepts, but also
the relationships between them. Designed to be a community-driven platform that encourages the
sharing of research knowledge and facilitates collaboration among researchers, ORKG provides a search
functionality that allows users to explore the knowledge graphs and discover new research topics and
connections. Moreover, it provides APIs that enable developers to build applications that use diferent
aspects of the data it contains.
        </p>
        <p>
          OpenAlex11 is a heterogeneous directed graph, composed of five types of scholarly entities (authors,
institutions, concepts, publishers, and sources), and the connections between them. It includes more
than 248M works and it contains important identifiers including ORCIS, ROR, ISSN, etc. OpenAlex data
can be used to build scholarly search engines, recommender services, or domain-specific knowledge
graphs. It can help manage research by tracking citation impact, spotting emerging areas, etc [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          Domain-specific knowledge graphs regarding science and computer science fields represent structured
information about concepts, topics, entities, and relationships in the field of computer science such as
CSKG [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ]. The Academia/Industry DynAmics (AIDA) Knowledge Graph describes 21M publications
and 8M patents according to the research topics drawn from the Computer Science Ontology12.
        </p>
        <p>Despite these continuous eforts, it has been argued that a great deal of information about academic
conferences is still missing or spread across several sources in a largely chaotic and non-structured way
[20]. Besides the problem of missing content, one of the other major challenges with scholarly data is
to ensure data quality, which means dealing with data-entry errors, disparate citation formats, lack
of (enforcement of) standards, imperfect citation-gathering software, ambiguous author names, and
abbreviations of publication venue titles [21].</p>
        <p>Although many generic or domain-specific scholarly KGs have been developed in the state-of-the-art,
they have several drawbacks with regard to the aim of this paper: (i) most KGs do not include patent
information, apart from AIDA, which itself does not ofer rich query capabilities, including filtering by
author name; (ii) none of the KGs include authors’ skills. Moreover, to the best of our knowledge, none
of the available literature showcases a real use-case scenario of such KGs.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Organizational and Skills Ontologies</title>
        <p>Organizational ontologies represent entities, relationships, and properties that are relevant to
organizations with the aim to facilitate the sharing and integration of organizational knowledge.</p>
        <p>The ORG Vocabulary and W3C Organization Ontology are two well-known ontologies to describe
the organisation of a company. The ORG ontology was developed as part of data.gov.uk initiative and
is a small and generic ontology with the aim to publish information for the organizational structure of
government institutions. It provides minimal basic terms to support representations of: (i) organizational
structure, (ii) reporting structure, (iii) location information, and (iv) organizational history (merger,
renaming, re-purposing). However, the ORG ontology is limited to some core base concepts and
does not provide category structures for organization type, organization purpose, or roles. The W3C
Organization Ontology instead is a vocabulary for describing organizational structures and relationships
within and between organizations. Its design allows domain-specific extensions that support additional
classification of organizations and roles, as well as extensions to support neighbouring information
such as organizational activities. Diferently from the ORG that was built with the aim to represent
governmental institutions, the W3C Organizational Ontology can be used to represent a wide range
of organizational structures, including companies, government agencies, non-profit organizations,
and more. The ORG Vocabulary and the W3C Organization Ontology are closely related. In fact,
the ORG Vocabulary was developed as an extension of the W3C Organization Ontology, with the
goal of providing additional classes and properties that were specific to organizational structures and
10https://orkg.org/
11https://docs.openalex.org/
12https://cso.kmi.open.ac.uk/schema/cso
relationships. Despite the fact that both ontologies are used to describe organisational structures, ORG
Vocabulary focuses more on the formal and legal aspects of organizations, while the W3C Organization
Ontology is more general and can be applied to a broader range of organizational types.</p>
        <p>Schema.org vocabulary is a set of schemas used to structure web content in a semantically meaningful
way. It is used for marking up web pages in a way that search engines can easily understand the content
and provide more relevant results to users. It includes a wide range of types, such as products, recipes,
people, events, and more, with properties that describe their attributes and relationships.</p>
        <p>On the other hand, there are several ontologies proposed to represent competences and skills of
organisation’ employees. The European Skills, Competences, Qualifications and Occupations 13(ESCO)
ontology focuses on the EU labour market, describing skills and qualifications specific to the region.
It covers three diferent domains – the three “pillars” of ESCO: i) occupations, ii) knowledge, skills
and competences, and iii) qualifications[ 22]. The data model14 is based on the Simple Knowledge
Organization System (SKOS)15 ontology which is used for representing knowledge organization systems,
like thesauri, taxonomies and classification schemes. ESCO concepts are subclasses of SKOS concepts,
with some additional metadata properties to structure the ESCO pillars. ESCO defines more than 10 000
concepts using 24 EU languages.</p>
        <p>The Skills and Recruitment Ontology16 (SARO) is a domain ontology representing occupations, skills
and recruitment. Inspired by ESCO and Schema.org17, SARO covers four dimensions: job posts, skills,
qualifications, and users. It extends the ESCO SkillandQualification concept and introduces around
1000 concrete skill instances. However, in contrast with ESCO, SARO also describes the proficiency level
for each skill. Such an ontology has been evaluated on the TOBIE system that comprises processing
pipelines that extract the desired set of skills and job posting attributes and create a knowledge base
that can be used for analysing the skill demand in the labor market domain [23].</p>
        <p>While ESCO and SARO ontologies are rich resources for describing skills and competencies, they
go beyond the scope of this paper and beyond the requirements for our use cases. Instead, the W3C
Organisational Ontology and Schema.org provide a good starting point for our use cases, although we
needed to design a minimal extension of the the W3C ontology to cover our application scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. sEKG construction</title>
      <p>sEKG is constructed in four steps: (i) dataset collection, (ii) skills extraction, (iii) ontology extension,
and (iv) knowledge graph population. The overall pipeline is depicted in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset collection</title>
        <p>The two sources of data that we consider for the construction of sEKG are patent and publication data.</p>
        <p>Patent data are downloaded from the US Patent and Trademark Ofice (USPTO) 18. For the scope of
this paper, we collected all granted patents in the US since 2013. Patent data are made available by the
USPTO as bulk zip documents, containing XML descriptions of patents and the DTD to define their
structure. To keep our system independent of the input data format, we transform each XML document
into a JSON representation model, only extracting the attributes needed for the scope of this work. At
the end of preprocessing, we have 3, 566, 517 US patents.</p>
        <p>Scholarly data are metadata about the scientific publications by the company employees. Data include
title, authors, publication venue, keywords, abstract, and publication year. We limit our approach to
these attributes because they are typically available, even for non-open-access papers, allowing us to
generalize our method.
13https://esco.ec.europa.eu/en/use-esco/download
14https://ec.europa.eu/esco/lod/static/model.html
15https://www.w3.org/TR/2008/WD-skos-reference-20080829/skos.html
16https://elisasibarani.github.io/SARO/
17https://schema.org/
18https://www.uspto.gov/</p>
        <p>For each paper document, we enrich the initial metadata and produce a total of 21 attributes, adding
additional information such as author aliases, external IDs for papers and authors, topical categories,
etc. Each patent document has a total of 23 attributes, among which the oficial USPTO categorization
(first, second, and third level category), abstract, author name, author afiliation, organization, claim,
description, publication year, etc. We align these attributes to ensure that the corresponding data from
both sources can be compared and analyzed eficiently.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Skills extraction</title>
        <p>The ability to extract skills from scientific papers and patents can provide insights into trends and support
decision-making in diferent domains. Extracting skills from such resources might be challenging due
to the use of domain-specific jargon and the varying level of detail provided in the analyzed text. At
the time of this manuscript, we implemented two extractors: one based on Latent Dirichlet Allocation
(LDA), and one based on a Wikidata annotation API.</p>
        <p>LDA, an unsupervised generative probabilistic method for topic modeling [24], is applied in this
paper for skills extraction, inspired by [25]. The rationale behind such a decision is that LDA assumes
that the set of words that have the main contribution in representing a topic are conceptually related
and they all are talking about the same concept (skill, or competency). LDA extracts the most relevant
words (concepts) from the text. The most common groups of words that co-occur can be interpreted as
skills, competencies, or experience.</p>
        <p>We pre-process the text of each patent/paper through standard tasks, including removal of HTML
tags, punctuation, digits, stop words, case normalization, and tokenization. Bi-grams are collected, and
part-of-speech tagging and lemmatization are applied. Bag-of-words representations are generated
using both CountVectorizer and TfidfVectorizer, popular techniques for converting text documents into
numerical formats suitable for machine learning algorithms like LDA. CountVectorizer creates a matrix
representing documents as a bag-of-words model, while TfidfVectorizer considers word frequency in
the corpus, assigning weights to each word based on its occurrence in the document and corpus. LDA
models are trained on document representations from CountVectorizer and TfidfVectorizer. Finally, the
top 5 skills are extracted for each document (abstract), and a list of all skills for each author is compiled
and stored in JSON documents.</p>
        <p>The second approach for extracting skills involves using state-of-the-art tools for automatically
detecting named entities in free text and aligning them to a predefined knowledge base. Examples
of such tools are Spotlight [26], X-Lisa [27], Babelfy [28], and Wikifier [ 29]. Specifically, we selected
Wikifier [29], which produces linkages to Wikidata - and Wikipedia. This approach benefits from the
extensive coverage of Wikidata/Wikipedia and can identify skills that may not have been explicitly
mentioned in the text but are related to the concepts discussed. Similarly, to the use of LDA, extracted
skills for each patent/paper are stored in sEKG.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Ontology extension</title>
        <p>The Semantic Web’s potential hinges on ontology reuse, facilitating shared data understanding and
minimizing redundancy. This paper employs a recommended ontology development approach [30],
extending well-established ontologies like W3C Organisational Ontology and Schema.org. Utilizing
a bottom-up approach, we identified general requirements and expanded concepts and predicates to
cover specific use cases. The ontology, expressed in RDF (Figure 2), integrates Schema.org and W3C
Organisational Ontology properties and introduces new ones (orange) for scholarly and patent data,
maintaining compliance and enhancing data interoperability.</p>
        <p>The main types considered in the ontology are derived from Schema.org (schema namespace) to
represent scientific papers and patents, while the main types and properties from W3C Org (org
namespace) are derived to represent the organisational organogram of the company. Among the most
frequent used types are:
• schema:CreativeWork is used to represent generic kind of creative work, including books, movies,
photographs, software programs, etc;
• org:Organisation is used to represent a collection of people organized together into a community
or other social, commercial, or political structure.
• sekg:Paper is a new type introduced in the ontology as a specialization of schema:CreativeWork
to represent scholarly data. Schema.org has types to represent journal papers, conference papers,
and also workshop papers. However, for the aim of this paper, we do not make any distinction
on the type of scholarly data, thus, we introduced a new generic one named Paper.
• sekg:Patent is a new type introduced in the ontology as a specialization of schema:CreativeWork
to represent patent data. Neither Schema.org, nor W3C Org ontologies provide this type to
represent patent data.
• sekg:Project is a new type to represent potential projects that an employee is working on.</p>
        <p>Schema.org has a type to represent projects, called schema:Project, however, such type is a
subclass of Organisation to describe an enterprise (potentially individual but typically
collaborative), planned to achieve a particular aim. As the semantics of such a type, is diferent from the
one referred to in this paper, we introduce such a concept as new.
• sekg:Skills is a new type to represent the skills of each employee. Neither W3C Org, nor</p>
        <p>Schema.Org defines a concept to represent skills or competency that a person might have. For this
reason, we introduced it as a new type, with the future intention of adding macro-categorizations
of skills.
• sekg:worksInProject is a new property used to represent and describe employees involved in a
particular project. Not only paper and patent data are important when extracting competencies
or skills of employees but also projects on which they are working.
• sekg:hasSkill is a new property used to represent and describe skills and competencies that an
employee has.</p>
        <p>• all the other properties to describe paper and patent attributes are considered from Schema.org.</p>
        <p>Domain and range restrictions are introduced for properties where only one class/datatype was
specified as the value of the domainIncludes and rangeIncludes properties. Users willing to extend the
ontology can look at the recommended types specified in Schema.org in the annotation properties. All
data, and textual data in particular, are represented using Unicode UTF-8 character encoding to support
interoperability across languages at the alphabet level. The ontology is available for further extension
or improvement19.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Knowledge Graph Population</title>
        <p>All collected and generated data about papers and patents is collated in an Elasticsearch index. Separately,
we use the sEKG ontology to describe a subset of the company employees, for the pilot study. Data
about such employees is retrieved from the underlying index and served internally via a REST API.</p>
        <p>Table 1 reports examples of interactions (queries/responses) that the sEKG mediated API can support.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Application Scenarios and Pilot Study</title>
      <p>We highlight the usage of sEKG for several application scenarios, including but not limited to Expert
Finding and Literature Review.</p>
      <p>Expert Finding is a key challenge in large companies focused on innovation and technological progress
- finding colleagues who can collaborate to provide feedback or guidance on cutting-edge projects and
research. This is especially true when receiving requests from external customers, coming with specific
needs and requirements: sometimes it is suficient to match their requests to existing products/services
that the company provides [31], but other times the problem is novel and we want to initiate a research
efort - and identify researchers that have the correct expertise. sEKG can be used to retrieve areas
of expertise for each employee, but also given a particular “skill”, it can identify employees that have
exhibited that skill in any of their artifacts.
19https://anonymous.4open.science/r/sEKGOntology-37EA/sekg.ttl</p>
      <p>Reviewing the state of the art is also a key activity for innovative enterprises. In the process of
submitting a patentable model or algorithm, an employee wants to make sure they could review all the
state-of-the-art. When reviewing available patents this can be especially challenging, both because of
the specific language used and because available search tools - e.g. those provided by USPTO 20 - can be
limited to mere keyword search.</p>
      <p>We conducted a pilot study of the efectiveness of sEKG at inferring skills for a subset of employees
of a big company. The goal of this evaluation is to determine whether the extracted skills were accurate
by gathering feedback from the employees. We recruited 7 volunteers to whom we gave the set of
skills in sEKG obtained with the two extractors defined in Section 3.2. Almost 67% of skills obtained
by linking to Wikidata were considered as correct by the users. Instead, the number of correct skills
extracted by applying LDA as a baseline is 44%. The errors fall into two categories. For the Wikidata
skills, some of the extracted concepts - while being relevant for the user - are not necessarily skills or
ifelds of expertise. For the skills extracted via LDA, we often have some very generic keywords. For
example, “dataset” could be a skill if it refers to data analysis, but not if it is simply a reference to a
data file mentioned in a text. We analyzed user feedback, which indicated overall satisfaction with the
extracted skills from both approaches. Users particularly appreciated the accuracy of skills associated
with Wikidata. This study revealed a limitation of the method: dificulty in accurately attributing skills
when papers or patents involve multiple authors. For instance, collaboration on diverse projects could
result in inaccurate skill assignments, particularly when extracting information solely from text. We
plan to mitigate the issue either with a human-in-the-loop extractor, based on dictionary expansion
techniques, similarly to [32] to allow each user to refine their skills in the sEKG, or by integrating
CRediT21 taxonomy. Additionally, inaccuracies in skill inference can also result from poor text quality.
We could observe that there exist several cases where text is very short, and contains typos, incomplete
sentences, or irrelevant content that could lead to the extraction of non-skills or irrelevant keywords.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>In this work, we presented scholarly enterprise knowledge graph (sEKG), a KG constructed of patent
and scientific papers. For the construction of such KG, only English papers were considered and patents
from the USPTO ofice. The data from both resources are in JSON comprising diferent attributes. In
the first step, we align these attributes to ensure that the corresponding data from both resources could
be compared and analyzed eficiently. In order to enrich and give semantics to such data, we used and
extend the W3C Org ontology. The ontology was extended by including concepts and properties from
Schema.org, and new concepts and properties were defined for portions of the data not covered by
existing ontologies or when their use in the present study was diferent. We depict a few use cases and
run a small pilot study to assess the feasibility and utility of sEKG. We mined skills from paper and
patent text using two diferent approaches, one based on classical NLP techniques, one on semantic
concept extraction and linking to Wikidata. A user evaluation involving seven volunteers indicated that
both methods were efective in extracting skills, but users preferred the Wikidata annotation approach
due to its higher accuracy.</p>
      <p>Given the preliminary nature of this work, we anticipate several future directions, including exploiting
the richness of semantic typing from Wikidata to filter skills as well as machine learning methods.
Moreover, large language models (LLMs) will be employed to enhance multiple steps of the sEKG
pipeline. For example, LLMs could automate and improve the precision of skill extraction from patents
and publications by analyzing text at a deeper semantic level, identifying complex relationships, and
understanding context beyond keyword matching. Additionally, LLMs could be used to refine the
knowledge graph population process, providing more accurate and contextually aware mappings
between entities, ultimately improving the overall quality and coverage of the sEKG Finally, we plan to
use existing human-in-the-loop techniques to refine and enrich user profiles.
20https://ppubs.uspto.gov/pubwebapp/
21https://credit.niso.org/origins/
and iswc metadata projects, in: Proc. of ISWC’07/ASWC’07, Springer-Verlag, Berlin, Heidelberg,
2007, pp. 802–815.
[18] A. G. Nuzzolese, A. L. Gentile, V. Presutti, A. Gangemi, Conference linked data: The scholarlydata
project, in: P. Groth, E. Simperl, A. J. G. Gray, M. Sabou, M. Krötzsch, F. Lécué, F. Flöck, Y. Gil
(Eds.), The Semantic Web - ISWC 2016 - 15th International Semantic Web Conference, Kobe, Japan,
October 17-21, 2016, Proceedings, Part II, volume 9982 of Lecture Notes in Computer Science, 2016, pp.
150–158. URL: https://doi.org/10.1007/978-3-319-46547-0_16. doi:10.1007/978- 3- 319- 46547- 0\
_16.
[19] J. Tang, Aminer: Toward understanding big scholar data, in: Proceedings of the ninth ACM
international conference on web search and data mining, 2016, pp. 467–467.
[20] V. Bryl, A. Birukou, K. Eckert, M. Kessler, What is in the proceedings? combining publisher’s and
researcher’s perspectives, in: Proc. of SePublica 2014, Anissaras, Greece, May 25th, 2014, 2014.
[21] D. Lee, J. Kang, P. Mitra, C. L. Giles, B.-W. On, Are your citations clean?, Communications of the</p>
      <p>ACM 50 (2007) 33–38.
[22] J. De Smedt, M. le Vrang, A. Papantoniou, Esco: Towards a semantic web for the european labor
market., in: Ldow@ www, 2015.
[23] E. Sibarani, S. Scerri, N. Mousavi, S. Auer, Ontology-based skills demand and trend analysis, 2016.
[24] H. Jelodar, Y. Wang, C. Yuan, X. Feng, X. Jiang, Y. Li, L. Zhao, Latent dirichlet allocation (lda)
and topic modeling: models, applications, a survey, Multimedia Tools and Applications 78 (2019)
15169–15211.
[25] S. Momtazi, F. Naumann, Topic modeling for expert finding using latent dirichlet allocation, Wiley</p>
      <p>Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3 (2013) 346–353.
[26] P. N. Mendes, M. Jakob, A. García-Silva, C. Bizer, Dbpedia spotlight: shedding light on the web of
documents, in: Proceedings of the 7th international conference on semantic systems, ACM, 2011,
pp. 1–8.
[27] L. Zhang, A. Rettinger, X-LiSA: cross-lingual semantic annotation, VLDB 7 (2014) 1693–1696.
[28] A. Moro, A. Raganato, R. Navigli, Entity linking meets word sense disambiguation: a unified
approach, Transactions of the Association for Computational Linguistics 2 (2014) 231–244.
[29] J. Brank, G. Leban, M. Grobelnik, Annotating documents with relevant wikipedia concepts,</p>
      <p>Proceedings of SiKDD 472 (2017).
[30] N. F. Noy, D. L. McGuinness, et al., Ontology development 101: A guide to creating your first
ontology, 2001.
[31] B. Shbita, A. L. Gentile, P. Li, C. DeLuca, G.-J. Ren, Understanding Customer Requirements:
An Enterprise Knowledge Graph Approach, in: Proceedings of ESWC 2023, Lecture Notes in
Computer Science, 2023, p. to appear.
[32] A. Alba, A. Coden, A. L. Gentile, D. Gruhl, P. Ristoski, S. Welch, Language agnostic dictionary
extraction, in: N. Nikitina, D. Song, A. Fokoue, P. Haase (Eds.), Proceedings of the ISWC 2017
Posters &amp; Demonstrations and Industry Tracks co-located with 16th International Semantic Web
Conference (ISWC 2017), Vienna, Austria, October 23rd - to - 25th, 2017, volume 1963 of CEUR
Workshop Proceedings, CEUR-WS.org, 2017. URL: https://ceur-ws.org/Vol-1963/paper611.pdf.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato, G. d. Melo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Knowledge</surname>
            <given-names>graphs</given-names>
          </string-name>
          ,
          <source>ACM Computing Surveys (CSUR) 54</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ives</surname>
          </string-name>
          ,
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          ,
          <source>in: The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference</source>
          ,
          <string-name>
            <surname>ISWC</surname>
          </string-name>
          <year>2007</year>
          +
          <article-title>ASWC 2007, Busan</article-title>
          , Korea,
          <source>November 11-15</source>
          ,
          <year>2007</year>
          . Proceedings, Springer,
          <year>2007</year>
          , pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. Pellissier</given-names>
            <surname>Tanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schafert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pintscher</surname>
          </string-name>
          ,
          <article-title>From freebase to wikidata: The great migration</article-title>
          ,
          <source>in: Proceedings of the 25th international conference on world wide web</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1419</fpage>
          -
          <lpage>1428</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Abu-Salih</surname>
          </string-name>
          ,
          <article-title>Domain-specific knowledge graphs: A survey</article-title>
          ,
          <source>Journal of Network and Computer Applications</source>
          <volume>185</volume>
          (
          <year>2021</year>
          )
          <fpage>103076</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Zhang,
          <article-title>Knowledge graph-based event embedding framework for financial quantitative investments</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2221</fpage>
          -
          <lpage>2230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I. Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Horng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sontag</surname>
          </string-name>
          ,
          <article-title>Robustly extracting medical knowledge from ehrs: a case study of learning a health knowledge graph</article-title>
          ,
          <source>in: PACIFIC SYMPOSIUM ON BIOCOMPUTING</source>
          <year>2020</year>
          , World Scientific,
          <year>2019</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tchechmedjiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fafalios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Boland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gasquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zapilko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <article-title>Claimskg: A knowledge graph of fact-checked claims</article-title>
          ,
          <source>in: The Semantic Web-ISWC</source>
          <year>2019</year>
          : 18th International Semantic Web Conference, Auckland, New Zealand,
          <source>October 26-30</source>
          ,
          <year>2019</year>
          , Proceedings,
          <source>Part II 18</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>309</fpage>
          -
          <lpage>324</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sava</surname>
          </string-name>
          , G. Rossiello,
          <string-name>
            <surname>M. F. M. Chowdhury</surname>
            ,
            <given-names>I. Yachbes</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gidh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Duckwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nisar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gliozzo</surname>
          </string-name>
          ,
          <article-title>Knowledge graph induction enabling recommending and trend analysis: a corporate research community use case</article-title>
          ,
          <source>in: The Semantic Web-ISWC</source>
          <year>2022</year>
          : 21st International Semantic Web Conference, Virtual Event,
          <source>October 23-27</source>
          ,
          <year>2022</year>
          , Proceedings, Springer,
          <year>2022</year>
          , pp.
          <fpage>827</fpage>
          -
          <lpage>844</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Rana</surname>
          </string-name>
          ,
          <article-title>State-of-the-art in open data research: Insights from existing literature and a research agenda</article-title>
          ,
          <source>Journal of organizational computing and electronic commerce 26</source>
          (
          <year>2016</year>
          )
          <fpage>14</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Färber</surname>
          </string-name>
          ,
          <article-title>The microsoft academic knowledge graph: A linked data source with 8 billion triples of scholarly data, in: The Semantic Web-ISWC</article-title>
          <year>2019</year>
          : 18th International Semantic Web Conference, Auckland, New Zealand,
          <source>October 26-30</source>
          ,
          <year>2019</year>
          , Proceedings,
          <source>Part II 18</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>129</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Jaradeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Farfar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prinz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kismihók</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Stocker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge</article-title>
          ,
          <source>in: Proceedings of the 10th International Conference on Knowledge Capture</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Priem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Piwowar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Orr</surname>
          </string-name>
          ,
          <article-title>Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts</article-title>
          ,
          <source>arXiv preprint arXiv:2205</source>
          .
          <year>01833</year>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dessí</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Reforgiato</given-names>
            <surname>Recupero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          , E. Motta,
          <article-title>Cs-kg: A large-scale knowledge graph of research entities and claims in computer science</article-title>
          ,
          <source>in: The Semantic Web-ISWC</source>
          <year>2022</year>
          : 21st International Semantic Web Conference, Virtual Event,
          <source>October 23-27</source>
          ,
          <year>2022</year>
          , Proceedings, Springer,
          <year>2022</year>
          , pp.
          <fpage>678</fpage>
          -
          <lpage>696</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Angioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Recupero</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Motta,</surname>
          </string-name>
          <article-title>The aida dashboard: a web application for assessing and comparing scientific conferences</article-title>
          ,
          <source>IEEE Access 10</source>
          (
          <year>2022</year>
          )
          <fpage>39471</fpage>
          -
          <lpage>39486</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovtun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prinz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kasprzik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Vidal</surname>
          </string-name>
          ,
          <article-title>Towards a knowledge graph for science</article-title>
          ,
          <source>in: Proceedings of the 8th international conference on web intelligence, mining and semantics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Rousseau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Torvik</surname>
          </string-name>
          , et al.,
          <article-title>Building a pubmed knowledge graph</article-title>
          ,
          <source>Scientific data 7</source>
          (
          <year>2020</year>
          )
          <fpage>205</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Handschuh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Domingue</surname>
          </string-name>
          ,
          <article-title>Recipes for semantic web dog food: The eswc</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>