<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semi-supervised Construction of Domain-specific Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lavdim Halilaj</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Richardsen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Dittberner</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Wauro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kim Ngan Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bosch Vietnam Co. Ltd.</institution>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Robert Bosch GmbH</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Current search engines are heavily optimized and excel on retrieving information based on a given set of</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Informed Decisions</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Similarity Search</kwd>
        <kwd>Semantic Search</kwd>
        <kwd>Information Extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the intensive use of digital technologies, the amount of data being generated on a daily
basis is increasing at an exponential rate. This data are exploited to support stakeholders
for making informed decisions for various domains and applications. The process of making
informed decisions starts with collection, distribution, and efectively using information and
knowledge that exist within an organization. As a result, individuals and organizations make
better decisions, which are crucial for the future success. However, the decision-making process
heavily relies on the ability to quickly and eficiently find relevant information. Traditional
solutions experience challenging times when dealing with the ambiguity of the terminology as
an inherent characteristic occurring in unstructured data. In this context, various automated
Information Extraction (IE) techniques are used to analyse textual descriptions and extract
useful information from them [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Indispensable parts of usual IE pipelines are so-called
dedicated discriminators dealing with specific types of information. Such kind of discriminators
are: 1) entity discovery or named entity recognition; 2) entity linking; 3) relation extraction and
4) topic extraction [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        On the other hand, Knowledge graphs (KGs) are means that can capture complex relationships
and interconnections between knowledge elements, e.g., entities, their attributes, and relations.
Thus it possible to represent and reason about large and diverse of information in a structured
way. This makes them a useful tool to represent the domain knowledge of a field of interest
which is understood by practitioners in that field. KGs are used to support semantic search via
improving the accuracy and relevance of the information found, which in turn enhances the
quality of the decision-making process [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. KGs can be build manually or (semi)automatically
using diferent techniques including those from natural language processing (NLP) or
rulebased methods. However, this task is very complex and heavily relies on human curation and
validation of the facts represented in a KG [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In this paper, we present a framework to automatically generate a domain-specific knowledge
graph with minimal supervision. The backbone of the framework is the ontology which encodes
explicit and unique meaning for each term. This is then used to contextualise the unstructured
text retrieved from the original sources and enrich them with additional context and definitions.
Our focus is on improving the access to the relevant information by: 1) automatically generating
the knowledge graph to represent all cases in a domain specific scenario described in free text
into unambiguous and machine-readable knowledge; and 2) developing a user-friendly interface
providing interactive features for exploration and visual inspection of graph data. Users can
perform various searches, like looking for single terms, a combination of the terminology
and topics as well as a full text search in conjunction with learned models. Regardless of the
synonyms used, the search results include detailed technical descriptions of the business cases,
along with an associated list of terms and topics. Further, users are able to navigate and explore
related concepts and information in the knowledge graph, which helps them to discover new and
relevant information. Overall, the semantic search powered with a knowledge graph provides
the following benefits: 1) describing concepts and relations with semantic enriched axioms,
allows for a more sophisticated graph exploration and knowledge acquisition; 2) the ability
to select categorical values in advance enables performing search on the subset of results and
thus reaching faster to the relevant information; and 3) adding new domain terminology and
specifying their synonymical connections directly in the graph, circumvent the necessity of
learning based methods for huge amount of pre-training data in order to recognize the relations
between similar terms.</p>
      <p>This paper is structured as follows: A motivating scenario is described in Section 2. A set of
identified domain-specific challenges that should be tackled are derived in Section 3. Related
work is outlined in Section 4. Section 5 presents our approach including the two main pipelines
and their respective components. In Section 6, we provide the concrete implementation details
and illustrate the user interface along with the details about the model evaluation. Section 7
concludes the paper and give an outlook of the future directions and extensions of this work.
Textual description for
the given column
Textual description for
the given column
Textual description for
the given column
Textual description for
the given column
Textual description for
the given column</p>
      <p>Textual description for
the given column
Textual description for
the given column
Textual description for
the given column
Textual description for
the given column
Textual description for
the given column
find the
keyword
relevant
2.</p>
    </sec>
    <sec id="sec-2">
      <title>Motivation</title>
    </sec>
    <sec id="sec-3">
      <title>Scenario</title>
      <p>For any product or service that is delivered to the internal divisions or customers, there can be a
number of issues that can happen. In daily basis, employees have to deal with new business cases
which are reported by various divisions or customers. These business cases cover issues related
to the safety, emissions, electronics, etc. Typically, the ifrst step performed after receiving the
information of a new case is searching for similar ones that have been reported before. For
each issue, a deep investigation of the causes and related factors should be done as well as the
afected components or services. In addition, the expert should provide recommendations for
solving that and measures needed to be taken into account. The ability to ifnd such cases is
crucial for preventing the duplication of the work with respect to the investigations that should
be realized. Further, it also avoids the risk of missing important aspects to be considered for the
similar cases, which might cause achieving divergent conclusions at the end of the workflow.</p>
      <p>However, the process of searching is tedious, time-consuming and error prone. This is due to
a number of factors related to the case representation and retrieval. While describing a given
business case, employees provide a diverse terminology, i.e. some use Term1 and others use
Term2, both pointing to the same functionality or category. This is further exacerbated due
to the fact that more than one language is usually used to describe various aspects of given
cases, i.e. a combination of English with the local language. As a consequence, searching based
on keywords only embodies the risk of missing similar cases and thus increasing eforts on
repeating the same analysis again as well as providing diferent conclusions.</p>
    </sec>
    <sec id="sec-4">
      <title>Domain-Specific</title>
    </sec>
    <sec id="sec-5">
      <title>Challenges</title>
      <p>Data from domain-specific sources poses a number of significant linguistic challenges w.r.t.
diverse terminology and the noisy phrases that need to be addressed. In the following, we
emphasize some of the challenges which are crucial for an easy information access and retrieval.</p>
      <p>CH1 - Heterogeneity: the structure of the raw data is typically very diverse. This hinders
the ability to exchange and mutually use of information across systems or applications.</p>
      <p>CH2 - Multilinguality: capturing information about the business cases is usually realized
using a combination between diferent languages. Therefore, it is of paramount of importance
to be able to handle enquires in multiple languages.</p>
      <p>CH3 - Ambiguity: user queries based on the natural language can be ambiguous and involve
multiple keywords. Thus, it is crucial to disambiguate mentioned entities within query in order
to return the most relevant results.</p>
      <p>CH4 - Multimodality: handling structured data, such as data stored in a knowledge graph,
in addition to the textual descriptions, is crucial for accessing the information via sophisticated
query mechanisms.</p>
      <p>CH5 - Evolution: domain knowledge can change over the time because of the dynamics,
or it is refined by experts. Therefore, the ability to reflect such changes quickly in the search
results is very important.</p>
    </sec>
    <sec id="sec-6">
      <title>4. Related Work</title>
      <p>
        The automatic construction of domain-specific KGs has been subject of investigation for years in
many domains, e.g. biomedical domain [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ], art and culture [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], and academic literatures
and algorithms [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Some of these approaches leverage generic ontologies such as DBpedia and WikiArt or
domain-specific ontologies to model relevant entities and their relations. An ontology-free
approach to automatically construct a knowledge graph for art-historic documents is presented
in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The approach is ontology-agnostic and use open information extraction techniques
to retrieve triples for the sentences. HDSKG [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is an automatic approach that leverages the
content of webpages for discovering domain specific concepts and their relations. It incorporates
a rule-based dependency parser in combination with machine learning algorithms to find the
candidate relations and estimate their domain relevance. To facilitate the process of information
extraction for the public KGs, authors in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed PLUMBER, a framework comprising
diferent reusable components. These components can perform various tasks, such as entity
recognition and linking, relation extraction and coreference resolution and can be combined
in up to 264 distinct pipelines. A framework for building a semantic knowledge bases to
support advanced information analysis and intelligent inference for a specific problem domain is
presented in [14]. It includes components and techniques to enable users developing customized
and reusable pipelines that can be applied to diferent domains. A comprehensive survey of
techniques that utilizes ontologies for information extraction tasks is provided in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Authors
group the majority of those techniques into two main categories, namely: rule-based - involving
experts defining patterns for denoting tokens, their interrelations and semantics; and
learningbased - which can be supervised, where a large annotated text corpora is used to build models;
unsupervised, where clustering techniques are applied to find statistical similarities based on
the tokens co-occurrences; and semi-supervised, where seed examples previously learned from
a small annotated model are used to recursively learn new patterns.
      </p>
      <p>In addition, there exist other approaches aiming to automate the KG construction using
declarative rules. These rules are specified via mapping languages such as R2RML 1 or RML [15]
as well as query languages like SPARQL2 and SPIN3. For instance, AutoMap4OBDA [16] and
BootOX [17] receive an input ontology and generate actual mappings in R2RML language for a
given relational database. Other solutions [18, 19] address the problem by employing additional
methods, such as heuristic rules, fuzzy search over the KGs or knowledge graph embeddings
to match the structure of tabular datasets to the given ontological concepts. Authors in [20]
investigates trends of automatic KG construction declarative mapping languages. In addition,
they discuss the challenges related to the maintainability, reproducibility explainability of KG
construction. Particularly, they elaborate shortcomings present into the declarative approaches
due to the fact they follow a linear generation process compared to the iterative process followed
by the traditional approaches based on IE techniques.</p>
    </sec>
    <sec id="sec-7">
      <title>5. Approach</title>
      <p>With the aim of improving the accuracy of the searching process and the relevance of retrieved
results, we designed a loosely-coupled framework. It allows incorporation of prior knowledge
formalized in an ontology, which later is utilized to guide the information extraction and the
searching process, respectively. The architecture, as shown in Figure 2, comprises two main
pipelines, namely: 1) Ingestion Pipeline; and 2) Consumption Pipeline.</p>
      <sec id="sec-7-1">
        <title>5.1. Ingestion Pipeline</title>
        <p>This essential pipeline enables automation of the process of bringing "raw" data into useful
insights and knowledge using semantic concepts. It consists of three main phases executed in
sequence: 1) Access and Preprocessing; 2) Knowledge Extraction; and 3) Integration and Enrichment.
Each phase comprises a number of components dedicated to perform specific tasks.</p>
        <sec id="sec-7-1-1">
          <title>5.1.1. Access and Preprocessing</title>
          <p>The main task of this phase is to obtain and prepare data for subsequent analysis and processing.
Next, the input data are transformed to a certain structure while avoiding parts that could
impair the accuracy and relevance of the final results, thus addressing the challenge CH1.
Case Retrieval The first step is related with accessing and retrieving information from the
given set of sources such as databases, or APIs. Moreover, the criteria for selecting relevant
data should be defined, along with the access permission for each data source.
Cleaning and Translation Here a number of tasks related to cleaning, transforming, and
formatting are performed. It may involve removing of duplicates, changing or aligning data
types, as well as normalizing data, important for handling the challenge CH1. As the quality
of the obtained data is crucial for the end results, this step is important towards ensuring the
1https://www.w3.org/TR/r2rml/
2https://www.w3.org/TR/sparql11-overview/
3https://spinrdf.org/
suitability of the data for the intended purpose. Considering the CH2, information is translated
to one single language, i.e. English, while preserving the original text into respective fields. In
addition, a number of predefined stop-words are removed to reduce the noise in the text with
the aim of enhancing the accuracy of the learning models.</p>
        </sec>
        <sec id="sec-7-1-2">
          <title>5.1.2. Knowledge Extraction</title>
          <p>Automatically extracting information from various unstructured or semi-structured sources,
such as text corpus, images, or audio files is an important step. In regard with challenge CH4, it
should be possible to identify relevant entities and relationships from a given corpus of text.
This information can later on be represented in a knowledge graph encapsulating the underlying
structure of the input data.</p>
          <p>The ontology We built the Business Cases Ontology (BCO) to formally represent the domain
specific information in form of classes, attributes and relationships. It captures essential details
about the business cases via the main concepts: a) Business Case - encapsulating metadata about
the business cases. We created several subclasses to further specify the nature of a particular
case, such as Safety, Exhaust Emission and Charging; b) Function - denoting the type of function
mentioned in the given case; and c) Requirement - describing the requirements relevant to the
case. To enable interlinking with other data sources several internal and external ontologies
such as Schema.org and DBpedia4 are reused. The BCO is developed in an iterative manner by
4https://schema.org; http://dbpedia.org/ontology#
involving technical experts on phases for requirements collection and domain conceptualization.
Best practices for collaborative development [21] and guidelines [22] such as quality assurance,
role definition, version labeling and naming conventions are utilized. Currently, BCO contains
29 classes, 15 object, 8 datatype, and 6 annotation properties, respectively.</p>
          <p>Entity Linking Recognising entities mentioned in each business case is essential for enabling
domain experts to quickly reach the information they need. Therefore, this component deals
with Entity Linking, including both recognizing and disambiguation of entities within the given
text. Detecting patterns and the context in textual descriptions can be achieved via a number of
techniques such as part-of-speech tagging, dependency parsing, or machine learning algorithms.
Moreover, in a domain-specific scenario, a dictionary of the entities can be provided as input
in advance. This dictionary is then used for examining the text w.r.t. mentioned entities. For
example, for each function “f_n” with the ontology identifier bco:Function_n present in a business
case "bc_m", we generate a link {bc_m, bco:hasFunction, bco:Function_n}.</p>
          <p>
            Further, the dictionary also may include the prior relations about synonyms, thus enabling
the search engine to "understand" terms with similar meanings which are interchangeable in
certain contexts. As a result, users can search for information using diferent but related terms,
without having to explicitly specify each possible term in their search query.
Topic Extraction The textual description of business cases can in principle comprise a
mixture of diferent topics  = 1, 2, . . . ,  where each topic is composed by number of words
co-occurring together [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. This component is responsible for automatically extracting topics
by identifying the underlying structures and hidden patterns. As a result, co-occurring words
and phrases grouped into topics are used to later-on enrich the knowledge graph. Methods
such as pattern matching, clustering, and topic modeling can be utilized to extract the most
relevant words and phrases for each topic. Those methods work based on diferent approaches:
1) unsupervised, typically generative models extracting the topics solely on correlation of words;
and 2) (semi)supervised with minimal human intervention, where a set of anchor topics and
their respective words are given as input. Here, we use a semi-unsupervised approach which
takes as input topics defined in the BCO ontology and the produces correlations of the textual
descriptions with those topics. As a result, each business case is linked to one or multiple topics,
allowing users to explore the knowledge and group the results based on diferent criteria.
          </p>
        </sec>
        <sec id="sec-7-1-3">
          <title>5.1.3. Integration and Enrichment</title>
          <p>This component is responsible for transformation of the results generated from the previous
steps into a knowledge graph. The objective is to convert received data into actionable insights
that later are used to filter and refine the relevance of the results and facilitate an informed
decision-making process. Representing the original data into a knowledge graph is realized
according the BCO ontology.</p>
          <p>The extracted entities and topics are then linked to the concepts in the BCO ontology
and enhanced with semantic definitions of concepts and their relationships. Finally, various
techniques such as inference based on rules or link prediction with machine learning can be
used to complete the graph and make explicit the implicit information.</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>5.2. Persistence</title>
        <p>The persistence component provides necessary mechanisms to store and manage with the
information represented in a knowledge graph over time as specified in the challenge CH5.
A KG represents a set of facts in form of triples,  = , ,  , where  denotes entities,
 ⊆  ×  entities or literal values and , a set of relation between elements in  and  . It
also provides accessing interfaces for later usage as well as the ability to perform more advanced
operations related to querying, indexing, and transaction execution. In addition, this component
has dedicated modules for learning latent representations and capturing the syntactic and
semantic properties. Various information modalities such as graph or text can be retrieved from
the storage and mapped to a high-dimensional vector space, where similar information are
embedded closely. Next, trained models may be used to provide the results based on calculating
the similarity between the given input and the learned vector representation.</p>
      </sec>
      <sec id="sec-7-3">
        <title>5.3. Consumption Pipeline</title>
        <p>After the raw data is fully transformed into a knowledge graph according the semantic concepts,
they are ready for further usage. This pipeline comprises a number of modules performing
various tasks to enable an eficient and efective information consumption, i.e. allowing users
to quickly retrieve relevant results.</p>
        <sec id="sec-7-3-1">
          <title>5.3.1. Backend - Application Logic</title>
          <p>The backend works in conjunction with the frontend and the persistence component. It handles
user requests towards the knowledge graph as well as to the machine learning models. In turn,
after the queries are executed, this component ofers results for further processing.
Result Clustering Techniques such as graph-based or learning-based can be combined to
refine the search space and avoid irrelevant content. The results from this combination of search
techniques should be synthesised to best match business cases. Depending on the relevance,
various methods are available to rank results retrieved from the given input. The rankings of
these individual methods can be summarized into an overall ranking and thus (according to the
ensemble principle) synergies from the specific advantages of the individual methods can be
maximized. The arithmetic, median or the harmonic mean value (or even more sophisticated
strategies) can be used as classification criteria for this superordinate ranking.</p>
        </sec>
        <sec id="sec-7-3-2">
          <title>5.3.2. Frontend - User Interaction</title>
          <p>Allowing users to interact with the system can be realized in many forms, including submitted
queries, feedback about the relevance of search results, as well as other kinds of graph exploration
and filtering features.</p>
          <p>One of the main objectives is to enable users to quickly and efectively find the relevant
information they are looking for. Therefore, adequate mechanisms for browsing and filtering
are essential to navigate more eficiently over large amounts of information, thus helping to
priory narrow down the search space using various categorical or numeric criteria. Further,
users areable to easily incorporate their provide feedback according the search results.</p>
          <p>The search results will contain detailed information of use cases along with a list of the
associated semantic concepts and their description. It enable users to navigate and explore in a
graph-based representation and having a quick overview of similar cases.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>6. Implementation</title>
      <p>We implemented our solution based on the architecture described in the previous section. In
the following, we give technical details about each component.</p>
      <sec id="sec-8-1">
        <title>6.1. Access and Preprocessing</title>
        <p>Accessing the given datasource is realized using the following libraries: requests==2.28.1 and
requests-toolbelt==0.10.0. The relevant columns to be selected from the given source are
predefined in advance. Original data are retrieved in CSV format via a process scheduled to run
once per week. The preprocessing step is performed using pyparsing==3.0.7 and includes data
cleaning tasks like deleting special characters and expressions, fixing white spaces and
punctuation. Further, as there might be a mixture of English, German and Japanese languages, data are
translated to English via specific models, e.g. BERT and libraries such as transformers==4.21.3,
sentence-transformers==2.2.2 and sentencepiece==0.1.96.</p>
      </sec>
      <sec id="sec-8-2">
        <title>6.2. Knowledge Extraction</title>
        <p>The Entity Extraction component is implemented based on the spaCy5 framework v.2.3.5 using
the rule-based extraction feature. The rule-based matcher engine allows for finding the words
and phrases in form of tokens within present business cases corpus. As input, we received a
dictionary with 270 unique functions and their respective descriptions. Next, a link for each
identified function in our scenario and the given business case is established.</p>
        <p>The Topic Extraction component is based on a semi-supervised method, namely Correlation
Explanation (CorEx) [23]. From domain experts, we received a list with 15 topics, each associated
with diferent number of words, ranging from a minimum of 3 to a maximum of 30. These
predefined topics are then incorporated as input to CorEx in the form of anchor words. After
the method execution, a total of 3,227 topics are identified for all business cases, with an average
of 3 topics per case.</p>
      </sec>
      <sec id="sec-8-3">
        <title>6.3. Business Case Knowledge Graph</title>
        <p>Data retrieved from the data source is structured in the CSV format. Using the defined pipeline
this data is then enriched based on the domain ontology. As a result, the Business Case
Knowledge Graph (BCKG) is generated and contains over 22 thousand triples. There are more
than 1000 business cases belonging to one or multiple categories. The BCKG is stored in Stardog
v.7.9.2. In addition, several built-in functions of Stardog for performing basic string similarity
such as cosine or levenshtein distance, respectively, are used.</p>
      </sec>
      <sec id="sec-8-4">
        <title>6.4. Learning Component</title>
        <p>The knowledge graph-based search is supported by further parallel search methods based only
on the textual entries such as TF-IDF and S-BERT [24]. In this scenario, knowledge graph acts as
a refining mechanism to narrow down the search space. The retrieved subset of results is then
used as input to the learning-based methods for further comparison, i.e. TF-IDF and S-BERT.
TFIDF is a plain bag-of-words method, which exploits the occurrence and the relative frequency of
the terms from the search entry. For this reason, it is able to deal with respective domain-specific
terminology, but has the disadvantage that diferent synonyms in search text and business cases
text cannot be related. In order to tackle this, we retrieve the synonyms from the BCKG, so the
user has the possibility to incorporate into the search query. The second assisting method, the
text-based similarity comparison via S-BERT, is characterized by the fact that the meaning of
sentences or text sections is encoded via embedding vectors, where the similarity relationship
between synonyms can be recognized, and ambiguities in the meaning can be resolved from
the textual context. The content of texts is not reduced to the presence of terms, but is derived
from the word combination under consideration of the structure and grammar of sentences.
However, this capability of similarity decoding is reduced to the plain word comparison or
comparison of tokens if the used terms are not represented in the underlying language model.
The S-BERT similarity comparison is integrated into an overarching aggregation method, that
converts the similarity relationships between individual sentences into similarity relationships
between larger text sections.
between subunits of the reference search text  and the base set . Individual scores for unit pairings
are aggregated to overall score.</p>
        <sec id="sec-8-4-1">
          <title>6.4.1. Aggregation Methods</title>
          <p>Diferent methods are developed to derive indicators for the quality of the obtained search
results. For this aim, two main sets X are defined:
• the base set  - represents any business case from database comprising the text from the
diferent attributes (e.g. case description, technical description, case conclusion, etc.) and
• the reference set  - represents one specific business case, and contains a textual summary
of this case, which is used as search entry (e.g. the separately stored attribute technical
assessment is used for this purpose).</p>
          <p>The expectation is that a business case in  related to the chosen summary from  should
be located in the top ranks (i.e. with low position index) of the search results. Further, for topic
clusters of existing business cases, their individual text descriptions are used as search input.
Here, we investigate to what extent other business cases of the same cluster are represented with
low position index in the search results. Based on these indicators, the subsequently introduced
aggregation methods are compared to decide for the most suitable ones.</p>
        </sec>
        <sec id="sec-8-4-2">
          <title>Definition:</title>
          <p>The set  (representing either  or ) is a set of sentences a.k.a. text units of the respective
text parts (see above) of the related business case. The number of text units in the set 
is denoted by ||, whereas</p>
          <p>() is the embedding vector of the i-th text unit. The  is the
corresponding unit lengths in words and ( ,  ) is the cosine of the intermediate angle
between two vectors  and  . Figure 4 illustrates each aggregation method defined as follows:
- SB1: Individual cosine similarity of average embedding vectors (averaged separately over
all sentences of  respectively ):
1(, ) = ((), ()) ; () = ||  ∈ 
- SB2: Average cosine similarity of all pairs of individual sentences from sets  and :
2(, ) = ∑︁  ∑︁ (︁ 
︁(

(), 

 ∈   ∈ 
(1)
(2)
searched on average over a larger number of cases for diferent S-BERT models (all-mpnet-base-v2,
paraphrase-mpnet-base-v2, bert-base-nli-tokens, all-MiniLM-L6-v1, all-MiniLM-L6-v2,
all-MiniLM-L12v1, all-MiniLM-L12-v2 and all-roberta-large-v1) and diferent aggregation methods. The lower the
position index, the more reliably the detection of the database cases related to the search texts.</p>
          <p>- SB3: Maximum cosine similarity for sentence from  with respect to sentences from 
(average over sentences of ):
(average over sentences of ):
- SB4: Maximum cosine similarity for sentence from  with respect to sentences from 
 ∈</p>
          <p>∈ 
 ∈ 
 ∈ 
︁(
︁(
︁(
3(, ) = ∑︁  max


- SB5: Arithmetic average of similarity between SB3 and SB4:</p>
          <p>5(, ) = ( 2(, ) + 4(, ) ) / 2</p>
        </sec>
        <sec id="sec-8-4-3">
          <title>6.4.2. Model Selection</title>
          <p>Apart from the aggregation method also the selected S-BERT model has a significant impact
on the quality of the obtained results. Hence diferent pretrained S-BERT models should be
investigated based on these indicators. The preselection (see Figure 5) of these models is based
on the performance scores for sentence embeddings and performance semantic search on 6.
6https://www.sbert.net/docs/pretrained_models.html
For the comparison, a selection of reference texts in  is used as search input and the received
cases from  are sorted according to their respective similarity. The obtained position indexes
correspond to the ranking of the business cases in  related to the respective reference text
 within the search results (averaged over all selected reference texts ). This procedure is
repeated for all preselected S-BERT models jointly with the aggregation methods described
as above. As can be seen from Figure 5, the lowest position indexes result for the model
(all-MiniLM-L6-v2) together with the aggregation method SB5.</p>
          <p>The evaluation of the quality indexes also revealed the fact, that especially for very short
search text entries (consisting of only one sentence) the aggregation method SB3 can perform
even better than method SB5. Additionally, it should be mentioned here, that intended search
results may be based on more abstract aspects which are often very dificult to implement with
a plain text-based similarity comparison. More advanced investigations to improve the search
results in this respect are ongoing.
6.4.3. Backend</p>
        </sec>
        <sec id="sec-8-4-4">
          <title>6.4.4. Frontend</title>
          <p>The backend is implemented using python v.3.9 and various libraries dedicated for interacting
with the knowledge graph as well as handling the application logic and processing.
Communicating with the triple store is is realized using SPARQLWrapper v.2.0. On the other hand the
application logic is implemented using rdflib v.5.0 and pandas v.1.2.5.</p>
          <p>The frontend is implemented using Angular 9.1.13, vis-data 7.1.4 and vis-network: 9.1.2. It
ofers various user-friendly forms for interaction and displaying the results. The user can select
between diferent searching methods in combination with various criteria and the operators
between them i.e. OR, AND. These searching methods depicted in Figure 6a are as follows:
Graph-based - use the selected categories, the operators, and the given text. These are then
posted to the knowledge graph platform in form of queries with the selected values and the text
to calculate the similarity with the other cases. A further advanced feature enables leveraging
the subclasses defined in the ontology in combination with Boolean operators, thereby merging
categories and tokens with OR or NOT to allow alternatives or exclude certain terms. One
particular extension is including synonyms or special categories of business cases.
Learning-based - first, the categories from the knowledge graph are used to refine or narrow
down the search space. Then the subset of results is sent to the chosen method either TF-IDF or
a BERT variant as described above, where the similarity calculation is performed.</p>
          <p>Moreover, users may post simultaneous queries against all searching methods, i.e. graph,
TF-IDF, or Bert. Therefore, as shown in Figure 6b, we implemented five diferent techniques to
allow ranking of the retrieved results: 1) First Rank; 2) Median Rank; 3) Last Rank; 4) Harmonic
Mean; and 5) Arithmetic Mean. The results coming from the selected methods are finally ordered
according to their respective similarity score and the selected technique.</p>
          <p>(a) Various searching methods
(b) Various ensemble methods</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>7. Conclusions</title>
      <p>This article presents a framework to automatically construct a KG for the given industrial
domain. The input data are provided in textual descriptions for all business cases. The approach
comprises a number of components to support information extraction, like entities and topics.
Next, the extracted entities are linked with the specific concepts of the developed ontology. At
the end of the pipeline, information is transformed into a knowledge graph. Further, in order to
support users for accessing relevant information, we build a specific application. It contains
two main components, namely backend, and frontend. Users are able to easily use various
methods for searching and reaching to the relevant results. They can combine diferent criteria
to restrict the search scope against the knowledge graph. Then, on the subset of the returned
results, it is possible to apply diferent methods like cosine similarity, TF-IDF, or BERT. Further,
it is also possible to leverage the synonyms which are manually defined by domain experts.
In summary, using a knowledge graph to power semantic search provides several advantages,
such as: 1) semantic enrichment of concepts and relationships enhances knowledge exploration
and acquisition; 2) pre-selection of categorical values enables more eficient search and quickly
access relevant information; and 3) incorporating priory new domain terminology along with
their synonyms helps on tackling the need for vast amounts of pre-training data to recognize
relationships between similar terms.</p>
      <p>The ambiguity and impreciseness present in the natural language text make the automatic
construction of a KG a very challenging task. Therefore, our future direction is to exploit the
power of the large language models for extracting the pieces of information in form of subgraphs
before integrating them into the domain knowledge graph. Further, we plan to further extend
our solution with another type of similarity based on graph topology and structure. As currently
the source dataset is rather small, the actual ingestion pipeline pulls once per week all data and
convert into a knowledge graph. However, as the size of the dataset is expected to grow over
time, we plan to work on techniques that allows retrieving and processing of the deltas only.
We also plan to use event-based mechanisms that would allow for putting data into the KG
as they arrive or after each change. As result, the KG will always reflect the latest state of the
business cases and any synchronization issue will be avoided.
[14] A. Aljamel, A knowledge-based framework for information extraction and exploration,</p>
      <p>Ph.D. thesis, Nottingham Trent University, UK, 2018.
[15] A. Dimou, M. V. Sande, P. Colpaert, R. Verborgh, E. Mannens, R. V. de Walle, RML: A
generic language for integrated RDF mappings of heterogeneous data, in: Proceedings of
the Workshop on Linked Data on the Web co-located with the 23rd International World
Wide Web Conference (WWW), CEUR Workshop Proceedings, 2014.
[16] Á. Sicilia, G. Nemirovski, Automap4obda: Automated generation of R2RML mappings
for OBDA, in: Knowledge Engineering and Knowledge Management - 20th International
Conference, EKAW Proceedings, volume 10024 of Lecture Notes in Computer Science, 2016,
pp. 577–592.
[17] E. Jiménez-Ruiz, E. Kharlamov, D. Zheleznyakov, I. Horrocks, C. Pinkel, M. G. Skjaeveland,
E. Thorstensen, J. Mora, Bootox: Bootstrapping OWL 2 ontologies and R2RML mappings
from relational databases, in: Proceedings of the ISWC Posters &amp; Demonstrations Track
co-located with the 14th International Semantic Web Conference (ISWC), volume 1486 of
CEUR Workshop Proceedings, CEUR-WS.org, 2015.
[18] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, H. Takeda, Semtab 2021: Tabular data
annotation with mtab tool, in: Proceedings of the Semantic Web Challenge on Tabular
Data to Knowledge Graph Matching co-located with the 20th International Semantic Web
Conference (ISWC ), CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 92–101.
[19] V. Huynh, J. Liu, Y. Chabot, F. Deuzé, T. Labbé, P. Monnin, R. Troncy, DAGOBAH: table
and graph contexts for eficient semantic annotation of tabular data, in: Proceedings of the
Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with
the 20th International Semantic Web Conference (ISWC), CEUR Workshop Proceedings,
2021, pp. 19–31.
[20] D. Chaves-Fraga, A. Dimou, Declarative description of knowledge graphs construction
automation: Status &amp; challenges, in: Proceedings of the 3rd International Workshop on
Knowledge Graph Construction (KGCW) co-located with 19th Extended Semantic Web
Conference (ESWC), CEUR Workshop Proceedings, 2022.
[21] L. Halilaj, I. Grangel-González, G. Coskun, S. Lohmann, S. Auer, Git4voc: Collaborative
vocabulary development based on git, International Journal on Semantic Computing 10
(2016) 167–192.
[22] I. Grangel-González, L. Halilaj, G. Coskun, S. Auer, Towards vocabulary development
by convention, in: KEOD - Proceedings of the International Conference on Knowledge
Engineering and Ontology Development, part of the 7th International Joint Conference
on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K),
Volume 2, SciTePress, 2015, pp. 334–343.
[23] R. J. Gallagher, K. Reing, D. C. Kale, G. V. Steeg, Anchored correlation explanation: Topic
modeling with minimal domain knowledge, Trans. Assoc. Comput. Linguistics 5 (2017)
529–542.
[24] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
in: Proceedings of the Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing,
EMNLPIJCNLP, Association for Computational Linguistics, 2019, pp. 3980–3990.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Bunescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Ramani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Marcotte</surname>
          </string-name>
          ,
          <article-title>Integrating co-occurrence statistics with information extraction for robust retrieval of protein interactions from medline</article-title>
          ,
          <source>in: Proceedings of the Workshop on Linking Natural Language and Biology</source>
          , BioNLP@NAACL-HLT, Association for Computational Linguistics,
          <year>2006</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aljamel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Osman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Thakker</surname>
          </string-name>
          ,
          <article-title>A semantic knowledge-based framework for information extraction and exploration</article-title>
          ,
          <source>International Journal of Decision Support System Technology (IJDSST) 13</source>
          (
          <year>2021</year>
          )
          <fpage>85</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Martínez-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>López-Arévalo, Information extraction meets the semantic web: A survey</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>255</fpage>
          -
          <lpage>335</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Chen,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Generative knowledge graph construction: A review</article-title>
          ,
          <source>in: Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , EMNLP, Association for Computational Linguistics,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>H.-M. Müller</surname>
            ,
            <given-names>E. E.</given-names>
          </string-name>
          <string-name>
            <surname>Kenny</surname>
            ,
            <given-names>P. W.</given-names>
          </string-name>
          <string-name>
            <surname>Sternberg</surname>
            ,
            <given-names>Textpresso:</given-names>
          </string-name>
          <article-title>An ontology-based information retrieval and extraction system for biological literature</article-title>
          ,
          <source>PLOS Biology 2</source>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Constructing biomedical domain-specific knowledge graph with minimum supervision</article-title>
          ,
          <source>Knowl. Inf. Syst</source>
          .
          <volume>62</volume>
          (
          <year>2020</year>
          )
          <fpage>317</fpage>
          -
          <lpage>336</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ernst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siu</surname>
          </string-name>
          , G. Weikum,
          <article-title>Knowlife: a versatile approach for constructing a large knowledge graph for biomedical sciences</article-title>
          ,
          <source>BMC Bioinform</source>
          .
          <volume>16</volume>
          (
          <year>2015</year>
          )
          <volume>157</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>157</lpage>
          :
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Paintkg: the painting knowledge graph using bilstm-crf</article-title>
          ,
          <source>in: 2020 International Conference on Information Science and Education (ICISE-IE)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>412</fpage>
          -
          <lpage>417</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICISE51755.
          <year>2020</year>
          .
          <volume>00094</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hunter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Odat</surname>
          </string-name>
          ,
          <article-title>Building a semantic knowledge-base for painting conservators</article-title>
          , in: IEEE 7th International Conference on E-Science, e-Science,
          <source>IEEE Computer Society</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <article-title>An approach for automatic construction of an algorithmic knowledge graph from textual resources</article-title>
          ,
          <source>in: Proceedings of the 1st International Workshop on Knowledge Graph Generation From Text and the 1st International Workshop on Modular Knowledge co-located with 19th Extended Semantic Conference (ESWC)</source>
          , volume
          <volume>3184</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Múnera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomaeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Streit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thormeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krestel</surname>
          </string-name>
          ,
          <article-title>Generating domain-specific knowledge graphs: Challenges with open information extraction</article-title>
          ,
          <source>in: Proceedings of the 1st International Workshop on Knowledge Graph Generation From Text co-located with 19th Extended Semantic Conference (ESWC)</source>
          , volume
          <volume>3184</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Kabir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sawada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>HDSKG: harvesting domain specific knowledge graph from content of webpages</article-title>
          ,
          <source>in: IEEE 24th International Conference on Software Analysis, Evolution and Reengineering</source>
          ,
          <string-name>
            <surname>SANER</surname>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Jaradeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <article-title>Better call the plumber: Orchestrating dynamic information extraction pipelines</article-title>
          , in: Web Engineering - 21st International Conference,
          <source>ICWE Proceedings</source>
          , volume
          <volume>12706</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2021</year>
          , pp.
          <fpage>240</fpage>
          -
          <lpage>254</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>