=Paper= {{Paper |id=Vol-3318/short18 |storemode=property |title=Knowledge Management System with NLP-Assisted Annotations: A Brief Survey and Outlook |pdfUrl=https://ceur-ws.org/Vol-3318/short18.pdf |volume=Vol-3318 |authors=Baihan Lin |dblpUrl=https://dblp.org/rec/conf/cikm/Lin22 }} ==Knowledge Management System with NLP-Assisted Annotations: A Brief Survey and Outlook== https://ceur-ws.org/Vol-3318/short18.pdf
Knowledge Management System with NLP-Assisted
Annotations: A Brief Survey and Outlook
Baihan Lin1,*
1
    Columbia University, New York, NY 10027, USA


                                       Abstract
                                       Knowledge management systems (KMS) are in high demand for industrial researchers, chemical or research enterprises, or
                                       evidence-based decision making. However, existing systems have limitations in categorizing and organizing paper insights or
                                       relationships. Traditional databases are usually disjoint with logging systems, which limit its utility in generating concise,
                                       collated overviews. In this work, we briefly survey existing approaches of this problem space and propose a unified framework
                                       that utilizes relational databases to log hierarchical information to facilitate the research and writing process, or generate useful
                                       knowledge from references or insights from connected concepts. Our framework of bidirectional knowledge management
                                       system (BKMS) enables novel functionalities encompassing improved hierarchical note-taking, AI-assisted brainstorming,
                                       and multi-directional relationships. Potential applications include managing inventories and changes for manufacture or
                                       research enterprises, or generating analytic reports with evidence-based decision making.

                                       Keywords
                                       knowledge management, insight annotation, relational databases, natural language processing, machine learning



1. Introduction                                                                                        want the system to be able to automatically assign topic
                                                                                                       to some papers based on text data mining. The user can
Knowledge management systems (KMS) are the driv- filter the papers by topics. Within each paper, during
ing engines of modern day information technologies the reading, the scientist might want to log an insight
(IT). These IT systems store data in parsed ways and or note on certain paragraphs. Sometimes the notes can
retrieve knowledge insights to improve the information be about multiple papers, and their relationship can be
understanding, team collaboration and process alignment in various types. These notes or insights also have topic
within organizations and groups. As an engineering enti- tags, which can optionally be automatically curated. The
ties in high demand for industrial researchers, chemical system can also generate useful concepts or knowledge
or research enterprises and evidence-based decision mak- as well as their references to facilitate the research and
ing, knowledge management systems are often used by writing process of the scientist.
organizations to affect innovation performance and gen-                                                   We see from this example that the relationships be-
erate accurate metrics on organizational capacity [1], but tween papers chosen in academic fields can have multiple,
they can also be user-centric by centering the knowledge bidirectional relationships. Existing knowledge manage-
base around individual users or customers [2].                                                         ment systems for organizing research papers in scientific
   Take the application of reference management of aca- fields or organizing manufacture enterprises use directed
demic researchers as an example. KMS are often used by acyclic graphs, Bayesian networks, and machine learning
researchers to keep track of papers or subsets of papers [3], which have limitations in categorizing and organiz-
[3]. Usually, the research information of different papers ing these multi-faceted insights or relationships. This is
or references has meta information that can be filtered because many traditional databases are usually disjoint
and sorted. An example scenario would be: a scientist with logging systems, which limit its utility in generat-
logs or inputs a particular paper into a system, with each ing concise, collated overviews. In this work, we briefly
entry containing many meta information about the pa- survey existing approaches in the general field of these
pers. These meta information elements can be filtered knowledge management systems, and propose a unified
or sorted (e.g., by year, journal, author, etc.). Each paper framework as a solution to these challenges. In our frame-
might contain multiple concepts or topics, and each topic work, we describe a knowledge management system that
might contain multiple paper. In some cases, we might utilizes relational databases to log hierarchical informa-
                                                                                                       tion with connected concepts.
CIKM 22: Workshop on Human-In-the-Loop Data Curation, October
21, 2022, Atlanta, GA                                                                                     Back to the example problem of reference management,
*
  Corresponding author.                                                                                our KMS would utilize relational databases to log hierar-
$ baihan.lin@columbia.edu (B. Lin)                                                                     chical information to facilitate the research and writing
€ https://www.neuroinference.com/ (B. Lin)                                                             process, or to help generate useful knowledge from ref-
 0000-0002-7979-5509 (B. Lin)
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License erences or insights from connected concepts. This would
    CEUR
          Attribution 4.0 International (CC BY 4.0).
          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                       enable novel functionalities encompassing improved hier-
archical notetaking, AI-assisted brainstorming, and multi-    (like the topics). These are important insights to keep the
directional relationships. For instance, one can generate     factories or warehouses in safety.
reports given keywords or topics collating hierarchical          The second user scenario example is evidence-based
and intra-connected records. With these automatic anno-       decision making. In large business entities, critical de-
tations, the system can enable automatic curation of topic    cisions are usually made with a group of market re-
tags using text data mining. Other applications include       searchers or consulting firms that come up with vari-
managing inventories and changes for manufacture or           ous analytic reports. A knowledge management system
research enterprises or generating analytic reports with      with AI-assisted insight annotation can provide a fast and
evidence-based decision making.                               evidence-based solution by generating a report (given
   Although we have seen successful system designs in         the keyword or topic as input) which curates from hier-
commercial products such as Mendeley and recent com-          archical and interaconnected records. This hierarchical
munity efforts such as Open Research Knowledge Graph          knowledge graph can serve as a useful primer in impor-
(ORKG), we believe that our survey can still bring useful     tant decision making processes and guide the investiga-
and new insights on the practical considerations on the       tors to locate relevant resources.
intersections among machine learning, database manage-
ment and human-system collaboration. In the following         2.3. Case studies
sections, we will first briefly survey the existing knowl-
edge management systems approaches, and propose a             In this section, we outline three case studies that recent
unified bidirectional KMS (BKMS) framework that uti-          real-world knowledge management systems are likely
lizes relational databases to log hierarchical information    adopt to become more interconnected and intelligent.
to facilitate the research and writing and generate helpful      The concept of Internet of Things (IoT): The IoT advance-
knowledge from references or insights from related con-       ments consist of a series of disruptive digital technolo-
cepts. We present a useful and novel system design for        gies, semantic languages, and virtual identities that can
this bidirectional information management, formulate a        increases efficiency and effectiveness in daily life oper-
few potential use-cases for this design, address the four-    ations through interconnected communications among
subset system of NLP-assisted annotations, and discuss        devices and systems [4]. Other than these organizational
future design considerations.                                 benefit, IoT stimulates the innovation process in various
                                                              aspects, through fast iterations of knowledge flow and
                                                              information gathering [5]. In [6], researchers employ
2. An Applied Perspective                                     structural equation modelling on a sample of 298 Italian
                                                              firms from different sectors. Their study suggest that in-
2.1. Applications                                             terconnected knowledge management systems facilitate
There are different application domains for knowledge         the creation of a open and collaborative ecosystem by
management systems with relational databases and in-          utilizing the internal and external flows of knowledge
sight annotation enabled by machine learning, including       and increasing internal knowledge management capacity,
but not limited to reference manager for academic re-         which in turn increases innovation capacity.
searchers, education and research tool, consulting firm          Reference architecture: In the era of Industry 4.0 [7],
report generator with evidence-based decision making,         smart warehouses are envisioned to host production that
inventory management for manufacture or research en-          contains modular and efficient manufacturing systems
terprises, organizational tool for industries with high-      and characterizes scenarios in which products control
volume data, and internal auditing tool for customized        their own manufacturing process. As in our user scenario
employee metrics.                                             of warehouse inventory management, an optimal refer-
                                                              ence architecture would be the key to the warehouse
                                                              knowledge management system. For instance, [8] de-
2.2. User scenarios                                           scribes a pipeline to perform a series of systematic analy-
Other than the reference management example in our            ses to identify the key concerns and processes and eventu-
introduction, we also include two additional applications.    ally arrive at potential architecture of smart warehouses.
The first one is managing inventories and changes for         They conduct a case study at a large warehouse in the
manufacture, chemistry or research enterprises. The in-       food industry and illustrates that an introduction of a
ventories or measurements of factories usually involves       reference architecture can be effective and practical.
dependency and hierarchical interactions. A knowledge            Conversational recommendation systems: A conversa-
management system that uses a relational database in-         tional recommendation system (CRS) is a computer sys-
stead of disjoint databases with separate logging systems     tem that is able to have a conversation with a human
can enable useful curation function to offer very useful      user in order to make recommendations [9]. This is dif-
and concise report regarding key events or phenomon           ferent from traditional recommendation systems, which
Figure 1: A unified framework of a knowledge management system with relational databases and NLP-assisted annotation



do not interact with users. Often used in e-commerce,         4. NLP-Assisted Insight Annotation
social media, and entertainment applications, CRS are
becoming increasingly popular as they can provide a           As shown in the annotation component of Figure 1, there
more personalized and interactive experience for users,       are several routes we can utilize natural language pro-
but can pose additional challenges in managing differ-        cessing to generate and annotate insights within our
ent layers of knowledge at different states: the intent       databases. We will elaborate on how they play in knowl-
of the conversation, the entities matched by the intents,     edge management systems and survey modern machine
the long-term preferences of the users and similar users,     learning methods in each of these routes below.
their state-dependent preferences related to the current         Semantic similarity: In principle, any sentence or para-
contexts, and the relationships between different entities,   graph embeddings can help us characterize our document
intents and users. One practical examples is recommend-       and inventories of interest. For instance, the Doc2Vec em-
ing discussion topic to therapist during psychotherapy        bedding [12] is a popular unsupervised learning model
in real-time given automatically speech-transcribed dia-      that learns vector representations of sentences and text
logue records [10] and helpful visual analytics [11].         documents. It improves upon the traditional bag-of-
                                                              words representation by utilizing a distributed memory
                                                              that remembers what is missing from the current context.
3. Bidirectional KMS Framework                                SentenceBERT [13] is another popular option which mod-
                                                              ifies a pre-trained BERT network by using siamese and
Figure 1 outlines our framework of bidirectional knowl-       triplet network structures to infer semantically mean-
edge management systems (BKMS) with relational                ingful sentence embeddings. With word or sentence em-
databases and insight annotation powered by natural lan-      beddings, we can embed the document entries from our
guage processing (NLP). The user interface provides the       relational databases into vectors, and then compute the
entry points into our knowledge management systems.           cosine similarity between the vector at certain turn and
Different interfaces introduces different routes, but they    an inventory entry. With that, for each text, we obtain
all involve a parsing and extraction process to atomize       a N -dimension score for the said property. For instance,
the user inputs into nodes that connects in a small knowl-    the inventory can be written guidelines that evaluate
edge graph. This graph is then placed into a relational       the usefulness of certain documents, say, a list of lead-
database where their links are preserved. The orange          ership principles that some companies use to evaluate
and blue arrows indicates intro- and inter-database data      a candidate’s resume, work report or performance re-
flows. The relational databases include three parts. Some     view form. And the relational database could be hosting
databases in the relational databases are only used for       an employee’s self reported performance review form.
storage. Some are used for analysis and annotations. And      The system can automatically compute a score based on
some databases are kept to store annotated insights or        each item of the guidelines and annotate these document
other downstream analytical artifacts, which provide an       entry accordingly. Other applications can be evaluat-
additional data flow direction.                               ing the patient-doctor alignment from an automatically
transcribed psychotherapy sessions based on a clinical         be use as actionable knowledge graphs [25]. Recently,
questionnaire inventory, as shown in [14, 15, 16].             there have also been increasing interests in a modern
   Topic modeling: In natural language processing and          approach called neuro-symbolic AI [26, 27], where the
machine learning, a topic model is a type of statistical       well-founded knowledge representation and reasoning
graphical model that help uncover the abstract “topics”        from the symbolic perspective are integrated with deep
that appear in a collection of documents. The topic mod-       learning from the statistical perspective. This offers both
eling technique is frequently used in text-mining pipeline     effective predictive power and necessary explainability
to unravel the hidden semantic structures of a text body.      for many real-world applications.
This can be very handy in annotating the database en-
try. For instance, a user scenario could be in a clinical
consumer-facing chatbot, where the dialogue between            5. Practical Considerations
the client and agent is transcribed, and a topic model-
                                                               When designing a interconnected and intelligent knowl-
ing analysis is automatically performed and generate
                                                               edge management systems for a domain-specific applica-
a list of discussed topics and their scores based on se-
                                                               tion, here are some practical questions to be considered:
mantic similarity, as shown in [17]. Several state-of-the-
art neural topic models include the Neural Variational              • Database consideration: What are the storage ca-
Document Model (NVDM) [18] (an unsupervised text                      pacities of this technology?
modeling approach based on variational auto-encoder),
                                                                    • User interface: What visual and user interface is
Gaussian softmax construction (GSM) [19] (a NVDM vari-
                                                                      preferred by users?
ant), the Wasserstein-based Topic Model (WTM) [20], the
                                                                    • Organizational benefits: What specific organiza-
Embedded Topic Model (ETM) [21] among others.
                                                                      tional functionality would this system provide
   Text summarization: When the scale of our databases
                                                                      over current systems?
increases, maintaining the interpretability of our knowl-
edge management system becomes more and more chal-                  • Latency and responsiveness: What are the syn-
lenging. This expanding availability of documents and                 chronization capacities of this technology across
entries inside the database cannot yield actionable in-               devices?
sights without proper aggregation. The field of auto-               • Customization: Can users modify or customize
matic text summarization deals with this problem by                   this system to their own preferences?
producing a concise and fluent summary while preserv-               • Security: Would this technology allow for secure
ing key information content and overall meaning [22].                 encryption or storage of higher value data?
For instance, we can first group or cluster the database            • Collaboration: Would this system allow for col-
entries (such as paper abstracts, or reading notes as in              laborative use by multiple stakeholders?
our reference manager example) by their semantic sim-               • Investigation: What kind of insights or investiga-
ilarity or inferred topics. And then, within each group,              tions do we wish to gain from this system?
generate a condensed descriptions. A user case would                • I/O: Would this system allow import or export
be, automatically generating writing outlines or topics               from other knowledge management systems?
based on the available references and reading notes in
a paper reference manager. In the active field of text            Other than these practical questions to consider, a
summarization, extraction and abstraction are the two          more thorough design process would involve market
main approaches. The extractive summarization tech-            analysis (market size, emerging technologies, policies,
niques generate summaries by choosing a subset of the          challenges, new trends, and policies as in [28]), domain
sentences in the original text, by computing first an inter-   analysis (systematic activity for deriving, storing domain
mediate representation of the text, then a sentence score      knowledge to support the engineering design process as
and finally a subset selection operation onto the original     in [29]), business process modeling (i.e. identifying the
texts [23]. The abstraction approach uses latent semantic      lead processes and subprocess of outgoing products [30])
analysis, frequency-driven approaches [24] and topics          and architecture design with viewpoints (stakeholder
modeling which we cover above.                                 concerns, context diagram, decomposition view, uses
   Symbolic reasoning: While topic modeling offers in-         view, and deployment view [31, 32]). Sometimes, case
terpretable subjects, and text summarization offers in-        studies can also be useful to clarify the problem settings.
terpretable paragraphs, the logic and causal relationship         Since we are proposing the idea of introducing rela-
between these insights can be arbitrary. The field of          tional databases and various AI and symbolic techniques
symbolic AI bridge this gap by introducing high-level          in knowledge management systems, there are additional
and human-readable symbolic representations into these         future research challenges in relation to this proposition
practical problems. They can potentially derive logic          in terms of the human-system “collaboration” enabled by
programming rules and semantic relationships that can          these systems. Methodologically, tne machine learning
engine that powers many human-in-the-loop (HIL) solu-          [3] Y. M. Yee, C. L. Tan, R. Thurasamy, Back to ba-
tions in data curation is reinforcement learning methods           sics: building a knowledge management system,
that have been demonstrated to effectively learn from hu-          Strategic Direction (2019).
man interactions with the speech- or text-based systems        [4] V. Scuotto, A. Ferraris, S. Bresciani, Internet of
[33]. Operationally, from the human side, we need to               things: applications and challenges in smart cities.
encourage people to contribute their knowledge and ex-             a case study of ibm smart city projects., Business
pertise (e.g. crowdsourcing) by creating an effective user         Process Management Journal (2016).
interface that allows people to easily log in, search for      [5] Y. Malhotra,       Knowledge management for e-
and find the information they need.From the system side,           business performance: advancing information strat-
we need to ensure that knowledge is effectively captured           egy to “internet time”, Information Strategy: The
and stored, consistently updated to keep the knowledge             Executive’s Journal 16 (2000) 5–16.
up to date and accuratem and manage different types of         [6] G. Santoro, D. Vrontis, A. Thrassou, L. Dezi, The
knowledge such that it is accessible to the right people.          internet of things: Building a knowledge manage-
Finally, there are also ethical and societal considerations        ment system for open innovation and knowledge
when we use machine learning and AI to encode knowl-               management capacity, Technological forecasting
edge related to human biometrics and well-beings, as               and social change 136 (2018) 347–354.
reviewed in [34].                                              [7] H. Lasi, P. Fettke, H.-G. Kemper, T. Feld, M. Hoff-
                                                                   mann, Industry 4.0, Business & information sys-
                                                                   tems engineering 6 (2014) 239–242.
6. Conclusions                                                 [8] M. van Geest, B. Tekinerdogan, C. Catal, Design
                                                                   of a reference architecture for developing smart
In summary, we describe the applied problem of a knowl-
                                                                   warehouses in industry 4.0, Computers in industry
edge management systems that host information that
                                                                   124 (2021) 103343.
contain multiple and bidirectional relationships in layers
                                                               [9] Y. Sun, Y. Zhang, Conversational recommender
of meta data. We briefly survey the application domains,
                                                                   system, in: The 41st international acm sigir con-
user scenarios and the existing approaches in the fields,
                                                                   ference on research & development in information
and eventually propose a framework for a knowledge
                                                                   retrieval, 2018, pp. 235–244.
management system with relational database and NLP-
                                                              [10] B. Lin, G. Cecchi, D. Bouneffouf,          Supervi-
assisted insight annotation. In our framework, a knowl-
                                                                   sorbot: Nlp-annotated real-time recommenda-
edge management system can comprise a user interface
                                                                   tions of psychotherapy treatment strategies with
to provide input and present output relating to one or
                                                                   deep reinforcement learning,         arXiv preprint
more documents or sensors. The system maintains a re-
                                                                   arXiv:2208.13077 (2022).
lational database storing information relating to the one
                                                              [11] B. Lin, Voice2alliance: automatic speaker diariza-
or more documents, and a knowledge parsing unit, in
                                                                   tion and quality assurance of conversational align-
communication to the user interface and the server, can
                                                                   ment, in: INTERSPEECH, 2022.
determine at a first time instance the metadata informa-
                                                              [12] Q. Le, T. Mikolov, Distributed representations of
tion elements associated with the particular document
                                                                   sentences and documents, in: International confer-
entry. The databases can then be automatically anno-
                                                                   ence on machine learning, PMLR, 2014, pp. 1188–
tated with NLP techniques such as semantic similarity
                                                                   1196.
analysis, topic modeling, text summarization and sym-
                                                              [13] N. Reimers, I. Gurevych, Sentence-bert: Sentence
bolic reasoning. A knowledge graph can then be learned
                                                                   embeddings using siamese bert-networks, Preprint
from these language models to be used as interpretable
                                                                   arXiv:1908.10084 (2019).
insights for real-world downstream tasks.
                                                              [14] B. Lin, G. Cecchi, D. Bouneffouf, Deep annotation
                                                                   of therapeutic working alliance in psychotherapy,
References                                                         Preprint arXiv:2204.05522 (2022).
                                                              [15] B. Lin, Personality effect on psychotherapy out-
 [1] B. Lawson, D. Samson, Developing innovation ca-               come: A predictive natural language processing
     pability in organisations: a dynamic capabilities             framework, arXiv preprint (2022).
     approach, International journal of innovation man-       [16] B. Lin, G. Cecchi, D. Bouneffouf, Working alliance
     agement 5 (2001) 377–400.                                     transformer for psychotherapy dialogue classifica-
 [2] M. A. Kabir, J. Han, J. Yu, A. Colman, User-                  tion, arXiv preprint arXiv:2210.15603 (2022).
     centric social context information management: an        [17] B. Lin, D. Bouneffouf, G. Cecchi, R. Tejwani, Neural
     ontology-based approach and platform, Personal                topic modeling of psychotherapy sessions, Preprint
     and Ubiquitous Computing 18 (2014) 1061–1083.                 arXiv:2204.10189 (2022).
                                                              [18] Y. Miao, L. Yu, P. Blunsom, Neural variational infer-
     ence for text processing, in: International confer-      (2022).
     ence on machine learning, PMLR, 2016, pp. 1727– [34] B. Lin, Computational inference in cognitive sci-
     1736.                                                    ence: Operational, societal and ethical considera-
[19] Y. Miao, E. Grefenstette, P. Blunsom, Discovering        tions, arXiv preprint arXiv:2210.13526 (2022).
     discrete latent topics with neural variational infer-
     ence, in: International Conference on Machine
     Learning, PMLR, 2017, pp. 2410–2419.
[20] F. Nan, R. Ding, R. Nallapati, B. Xiang, Topic model-
     ing with wasserstein autoencoders, in: Proceedings
     of the 57th Annual Meeting of the Association for
     Computational Linguistics, 2019, pp. 6345–6381.
[21] A. B. Dieng, F. J. Ruiz, D. M. Blei, Topic modeling in
     embedding spaces, Transactions of the Association
     for Computational Linguistics 8 (2020) 439–453.
[22] M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei,
     E. D. Trippe, J. B. Gutierrez, K. Kochut, Text sum-
     marization techniques: a brief survey, Preprint
     arXiv:1707.02268 (2017).
[23] A. Nenkova, K. McKeown, A survey of text summa-
     rization techniques, in: Mining text data, Springer,
     2012, pp. 43–76.
[24] T. E. Dunning, Accurate methods for the statis-
     tics of surprise and coincidence, Computational
     linguistics 19 (1993) 61–74.
[25] M. Garnelo, M. Shanahan, Reconciling deep learn-
     ing with symbolic artificial intelligence: represent-
     ing objects and relations, Current Opinion in Be-
     havioral Sciences 29 (2019) 17–23.
[26] A. d. Garcez, L. C. Lamb, Neurosymbolic ai: the 3rd
     wave, Preprint arXiv:2012.05876 (2020).
[27] J. Zhang, B. Chen, L. Zhang, X. Ke, H. Ding, Neural,
     symbolic and neural-symbolic reasoning on knowl-
     edge graphs, AI Open (2021).
[28] G. Giudici, A. Milne, D. Vinogradov, Cryptocurren-
     cies: market analysis and perspectives, Journal of
     Industrial and Business Economics 47 (2020) 1–18.
[29] Ö. Köksal, B. Tekinerdogan, Feature-driven domain
     analysis of session layer protocols of internet of
     things, in: 2017 IEEE International Congress on
     Internet of Things (ICIOT), IEEE, 2017, pp. 105–112.
[30] M. Weske, Business process modelling foundation,
     in: Business Process Management, Springer, 2019,
     pp. 71–122.
[31] P. Clements, D. Garlan, R. Little, R. Nord, J. Stafford,
     Documenting software architectures: views and
     beyond, in: 25th International Conference on Soft-
     ware Engineering, 2003. Proceedings., IEEE, 2003,
     pp. 740–741.
[32] E. Demirli, B. Tekinerdogan, Software language en-
     gineering of architectural viewpoints, in: European
     Conference on Software Architecture, Springer,
     2011, pp. 336–343.
[33] B. Lin, Reinforcement learning and bandits for
     speech and language processing: Tutorial, review
     and outlook, arXiv preprint arXiv:2210.13623