=Paper=
{{Paper
|id=Vol-2741/paper-10
|storemode=property
|title=Extracting and Representing Causal Knowledge of Health
                     Condition
|pdfUrl=https://ceur-ws.org/Vol-2741/paper-10.pdf
|volume=Vol-2741
|authors=Hong Qing Yu
|dblpUrl=https://dblp.org/rec/conf/sigir/Yu20
}}
==Extracting and Representing Causal Knowledge of Health
                     Condition==
<pdf width="1500px">https://ceur-ws.org/Vol-2741/paper-10.pdf</pdf>
<pre>
Extracting and Representing Causal Knowledge
             of Health Conditions

                                     Hong Qing Yu

University of Bedfordshire, School of Computer Science and Technology, Luton, UK
                             hongqing.yu@beds.ac.uk


        Abstract. Most healthcare and health research organizations published
        their health knowledge on the web through HTML or semantic presen-
        tations nowadays e.g. UK National Health Service website. Especially,
        the HTML contents contain valuable information about the individual
        health condition and graph knowledge presents the semantics of words
        in the contents. This paper focuses on combining these two for extract-
        ing causality knowledge. Understanding causality relations is one of the
        crucial tasks to support building an Artificial Intelligent (AI) enabled
        healthcare system. Unlike other raw data sources used by AI processes,
        the causality semantic dataset is generated in this paper, which is be-
        lieved to be more efficient and transparent for supporting AI tasks. Cur-
        rently, neural network-based deep learning processes found themselves
        in a hard position to explain the prediction outputs, which is majorly
        because of lacking knowledge-based probability analysis. Dynamic proba-
        bility analysis based on causality modeling is a new research area that not
        only can model the knowledge in a machine-understandable way but also
        can create causal probability relations inside the knowledge. To achieve
        this, a causal probability generation framework is proposed in this paper
        that extends the current Description Logic (DL), applies semantic Natu-
        ral Language Processing (NLP) approach, and calculates runtime causal
        probabilities according to the given input conditions. The framework can
        be easily implemented using existing programming standards. The ex-
        perimental evaluations extract 383 common disease conditions from the
        UK NHS (the National Health Service) and enable automatically linked
        418 condition terms from the DBpedia dataset.

        Keywords: Knowledge Graph · Causality · Health · NLP · AI


1     Introduction

There are many high-quality health condition data available online, such as the
UK website of National Health Service and condition descriptions on Wikipedia.
Understanding the causal relations inside this data will be useful to enhance
    Copyright © 2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0). BIRDS 2020, 30 July
    2020, Xi’an, China (online).


                                            70
self-healthcare awareness and education. The research problem is how to extract
these causal relations automatically and understand the semantics from this
data e.g. sentences and paragraphs. For example, extracting Pneumonia is a
kind of disease and the coronavirus another kind of disease is one of the causes
to Pneumonia from the sentence ”Pneumonia can be caused by a virus, such
as a coronavirus (COVID-19)”. Besides, the probability is also an important
aspect of the causality due to uncertainties e.g. pneumonia can be caused not
only coronavirus but also other bacterial infections. In this paper, a probability-
based causality extracting and modeling framework is proposed to address this
research problem. Two major novelties of the paper are:
    (1) A formal health causality extracting framework is proposed to support
causal recognition, knowledge modeling, and runtime probability creation.
    (2) The first causal knowledge graph is created containing 383 health condi-
tions from the UK NHS website with causal links to 418 Wikipedia health terms
through DBpedia annotations.
    Rest of the paper has further 4 sections:
    Section 2 will discuss some related work.
    Section 3 will explain the whole framework and each of the steps.
    Section 4 will show the insight evaluation of the generated causal knowledge
graph.
    Section 5 will present the conclusion.


2   Related Work

Representing health knowledge in a way that the machine can easily process is
an important research area. The core topics in the field can be categorized into
three major groups.
    One is focusing on representing clinical data as knowledge, e.g. Electronic
Health Records (EHR). An integration process to build a common data model
was proposed by [1, 2] aimed to produce shareable, transportable, and com-
putable clinical data. However, the work only emphasized the system architec-
ture level (NoSQL) and data representation level (RDF) but did not directly
address knowledge understanding especially the causal relations. Many different
frameworks worked in this direction of ontology development and triple popu-
lating.
    The second category is to apply state of art machine learning approaches to
the existing KG data to perform prediction or classification tasks. The paper [3]
proposed a medical code prediction framework to build a KG with NLP and ex-
ternal Wikipedia semantic links to the information source. The prediction results
through graph vector encodings applied to the logistic regression classification
algorithm. However, these knowledge prediction approaches lack of explanations
and tractability. Moreover, they still can not tell the causes of such a prediction.
    The last direction is to directly add causal knowledge to the data. This type
of research can be traced back to the 1980s as Neyman-Rubin causal inference


                                        71
theory was published. However, the concepts of causal and association or cor-
relation are always been mixed or misunderstood until the formal mathematics
models are represented by Pearl in [4]. The model computes probability joint dis-
tribution on the directional graph that satisfies the back-door criterion, which
is a do(X=x) rather than a random x to have a probability prediction on Y
based on statistic knowledge. In simplified terms, the causal relation should be
observed if one property were modified, then the other property of a probability
distribution would also change. Therefore, we can distinct associational rela-
tions and causal relations. Most recently, this idea has been applied on top of
the reinforcement learning process by the DeepMind team [5]. At the same time,
some work starts to investigate an approach to add probability concepts into
knowledge graphs to express knowledge with belief rating thresholds. Based on
this idea, a Probabilistic Description Logic (PDL) was explained in 2017 [8] to
deal with subjective uncertainty. The PDL extended Tbox and Abox definitions
in the classic Description Logic (DL) with probabilistic thresholds notations.
However, the probability needs to be defined at design time or from current
knowledge not able to be tuned dynamically. In addition, it completely does not
model the causal relations but is replaced by probability.


3   Causality Knowledge Extracting and Modelling

Overall, the causality extracting framework contains four major approaches as
shown in Fig. 1. The CNN algorithm is applied to identify the sentences that
contain causal relations. The composition of the NLP and semantic annotation
process is developed in generating semantic word tokens. The causality descrip-
tion logic is introduced to guide the causality knowledge graph generation by
lifting the semantic word tokens. Finally, the runtime probability knowledge
graph with defined probabilities will be created when certain inputs values are
calculated accordingly.


                    Fig. 1. Four approaches of the framework


                                       72
3.1   Causality recognition
Two methods have been applied in this approach. One is to directly believe that
certain sections of the web contents that should contain causality knowledge. For
example, the symptom and causes sections, which can be defined based on the
research of interests. The other method is to build a recognition AI model that
can identify sentences that has causality statement(s). There are two most recent
research results shows using self-attention deep neural networks can achieve more
than 70 percentage accuracy on this task [6, 7]. However, the scenarios are more
complex to detect and categorise multiple causal effect classes. In addition, these
algorithms are too expensive in terms of computing resources and time. Our
task majorly tells if the sentence contains causality that is a binary question. A
cheap solution is also the requirement in our scenario. To achieve it, five different
machine learning algorithms have been evaluated based on a training dataset.
The training dataset is composed of two datasets from the previous research
work presented in [9]. Table 1 shows that CNN model provided the best result
of recognising causal sentences.

                         Table 1. AI algorithms evaluation

Algorithms          Total accuracy F1 score CPU/GPU library
Random Forest       0.79           0.79     CPU     Scikit-learn
SVM                 0.81           0.81     CPU     Scikit-learn
Logistic Regression 0.81           0.81     CPU     Scikit-learn
MNB                 0.81           0.81     CPU     Scikit-learn
LSTM                0.88           0.86     GPU     Keras 2.0
CNN                 0.98           0.90     GPU     Keras 2.0


3.2   Causality knowledge modelling
The DL-ιt expression is refined to define Causal Probability Knowledge Base
(CPKB) that has four elements as equation 3.2 represent:
                            CP KB = {T, A, Φ, P (φ)}                             (1)


    Where T is the T-box ontology (Terminology structure). A is the A-box
instance (Assertions) and Φ is the root causal function that is the major extension
to traditional DL-ιt. Φ presents the causal relation that can be happened between
any concepts defined inside T. A subclass of Φ can be defined to indicate specific
causal relations between two concepts. P (φ) tells the probability values of causal
relations between two instances at Abox level and importantly only at runtime.
A set of runtime P (φ) is calculated based on the input observations.
    For the health condition application scenario, Fig. 2 presents the defined T-
box and Φ in OWL schema that includes twelve concepts and ten causalities (Φ)
and three normal relations.


                                         73
                    Fig. 2. Health condition CPKB definitions


3.3   Causality extraction and lifting process
The causality extraction process has two components:
    (1) NLP-based causal keywords tokenization is to capture the keywords that
may have causal relations in the identified causality texts from previous steps.
The tokenization follows classic NLP steps of segmentation, word tokenization,
remove stop words, stemming, and eventually get the noun keywords or phrases.
For example, the words of pneumonia, virus, and coronavirus will be captured
from the sentence of ”Pneumonia can be caused by a virus, such as a coronavirus
(COVID-19)”
    (2) Semantic lifting calls semantic annotation API (DBpedia spotlight) to
classify the keywords and phrases into different terms defined in CPKB ontology
based on the RDF: type and other related predictions described in the DBpedia
dataset. For example, the word ’Lung’ is a type of DBpedia anatomical structure
class defined by RDF: type of lung RDF data.
    Based on the above two components, we can extract causality for given sen-
tences or paragraphs. In the end, we can generate a knowledge graph for each
crawled health conditions from these CPKB based semantic populating. Cur-
rently, 383 health conditions’ knowledge graph is integrated from the UK NHS
website with additional causal semantic links to 418 Wikipedia health terms
through the DBpedia dataset.

3.4   Causality-based runtime probability knowledge graph
With health condition causality knowledge in hand, the runtime probability
knowledge graph can be dynamically generated based on the numbers of income
links to each of the inputs. For example, the input observed conditions for a boy
(child and male) are:
    Symptoms: cough, breathing, fever, heartbeat, chest pain, fatigue, and shiv-
ering, infection. Unwell body position: lung.

                                       74
    With these input conditions, the Fig. 3 (the partial graph of the actual graph
as an example) presents a runtime probability distribution among relevant causal
relations. For instance, the Pneumonia disease has around 0.0054 and 0.018
causal probabilities for problems of Heartbeat and Cough respect.


              Fig. 3. Runtime probability knowledge graph example


                                       75
4   Insight of Causality Knowledge Graph
After crawled health conditions throughout the NHS webpages and built se-
mantic causal relations with Wikipedia definitions and DBpedia terms, we gen-
erated a causality knowledge graph that contains 801 health conditions, 1078
symptoms/physiologies, 377 treatments including drugs, 8 categorized habits,
66 different human groups, and 113 species.
    Fig. 4 shows 25 symptoms or physiological reflections that have the most
connections with other health conditions. Interestingly, Schizophrenia a kind of
mental health condition can be developed from 264 diseases. The other notice-
able information is that many diseases may have sequela and contribute to rare
diseases. The figure also indicates that diabetes is one of the most common
symptoms of other diseases.


               Fig. 4. Top 25 symptoms or physiological reflections


    Based on the causal relations, eight habits or lifestyle-related scenarios can
contribute to developing serious health problems. The top one is the smoking-
related habits are most dangerous and connect to more than 100 diseases. The
other noticeable one is overeating.
    The causal reasoning result also shows that Autumn and Winter have the
most connections to diseases than other seasons which reflects common sense.
    Through causal relations, the condition chain is discovered. For example,
Rheumatoid arthritis → Psoriasis → Pagets disease nipple → Breast cancer
→ Weight loss. 3683 5-length-chains, 3847 4-length-chains, and 111186 3-length
chains are discovered so far. All these condition chains are the hidden knowledge
that is not identified in the original description on the webpages.
    Besides, the health conditions from NHS are clustered into 42 groups when
applying unsupervised K-mean clustering algorithm and cluster optimization
process. For example, a list of observations [‘headache, influenza, fever, throat,
children’] is mostly related to the health condition in Cluster 0 that contains


                                       76
12 diseases of [’Bornholm-disease’, ’Common-cold’, ’Diphtheria’, ’Chickenpox’,
’Flu’, ’Hand-foot-mouth-disease’, ’Polio’, ’Q-fever’, ’Roseola’, ’Rubella’, ’Slapped-
cheek-syndrome’, ’Tonsillitis’].


5   Conclusion and Future Work
A causality focused knowledge graph generation approach is introduced in this
paper. The major purposes of the work are to extract causal relations inside the
health descriptive data on the Web and to create a probability knowledge space
at runtime to support further AI tasks. The evaluations on the causal probability
knowledge graph have already shown some interesting conclusions and the ability
to enhance explanation capabilities of prediction and clustering approaches. The
implementation code and the dataset are available at [10]. There are two limita-
tions at current state of art. The first one is that some combination key words
e.g. Body pain have not been captured using classic NLP and semantic annota-
tion processes. The second one is that our knowledge has not fully connected to
external exist health knowledge datasets e.g. UMLS [11]. In the short-term, our
research will focus on addressing these limitations. The long-term future research
has a couple of directions. Firstly, to develop an efficient embedding method that
can contain causal relation features and apply well-studied machine learning al-
gorithms especially the deep learning architectures. Secondly, to investigate the
graph-based learning algorithm that can directly work on the graph data and
get utilization from the reasoning power from the graph, causal relations, and
runtime probability definitions.


References
1. Overhage, J. M., Ryan, P. B., Reich, C. G., Hartzema, A. G., Stang, P. E.: Val-
   idation of a common data model for active safety surveillance research. Journal
   of the American Medical Informatics Association : JAMIA, 19(1), 54–60, 2012.
   https://doi.org/10.1136/amiajnl-2011-000376
2. Rosenbloom, ST., Carroll, RJ., Warner, JL., Matheny, ME., Denny, JC.: Rep-
   resenting Knowledge Consistently Across Health Systems. Yearb Med Inform.
   2017;26(1):139-147. https://doi.org/10.15265/IY-2017-018
3. Bai, T., Vucetic, S.: Improving Medical Code Prediction from Clinical Text via
   Incorporating Online Knowledge Sources. In The World Wide Web Conference
   (WWW ’19), Ling Liu and Ryen White (Eds.). ACM, New York, NY, USA, 2019,
   72-82. https://doi.org/10.1145/3308558.3313485
4. Pearl, J. 2010. An Introduction to Causal Inference. The International Journal of
   Biostatistics. 6, 2 (2010).
5. Dasgupta, I., Wang, J., et al: Causal Reasoning from Meta-reinforcement Learning.
   arXiv preprint 2019, arXiv:1901.08162.
6. Li, Z., Li, Q., Zou, X., Ren, J.: Causality Extraction based on Self-Attentive
   BiLSTM-CRF with Transferred Embeddings, 2019 arXiv:abs/1904.07629.
7. Dasgupta, T., Saha, R., Dey, L., Naskar, A.: Automatic Extraction of Causal Rela-
   tions from Text using Linguistically Informed Deep Neural Networks, 2018, 306-316.
   10.18653/v1/W18-5035.


                                         77
8. Gutierrez-Basulto, V., Jung, J.C. and Lutz, C.: Probabilistic Description Logics for
   Subjective Uncertainty. Journal of Artifical Intelligence Research 58, 2017, 1-66.
9. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: Ontonotes: the 90
   percentage solution. In Proceedings of NAACL, Companion Volume: Short Papers,
   2006, pages 57-60. ACL.
10. NHS      causal   knowledge      graph    with      evaluation   and     clustering,
   https://github.com/semanticmachinelearning/nhscausalknolwedgegraph.             Last
   accessed 21 June 2020
11. Olivier, B.: The Unified Medical Language System (UMLS): integrating biomedical
   terminology. Nucleic Acids Res. 2004;32(Database issue):D267-D270.


                                          78

</pre>