<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Gutierrez-Basulto, V., Jung, J.C. and Lutz, C.: Probabilistic Description Logics for
Subjective Uncertainty. Journal of Arti cal Intelligence Research</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Extracting and Representing Causal Knowledge of Health Conditions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hong Qing Yu</string-name>
          <email>hongqing.yu@beds.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bedfordshire, School of Computer Science and Technology</institution>
          ,
          <addr-line>Luton</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>58</volume>
      <issue>2017</issue>
      <fpage>57</fpage>
      <lpage>60</lpage>
      <abstract>
        <p>Most healthcare and health research organizations published their health knowledge on the web through HTML or semantic presentations nowadays e.g. UK National Health Service website. Especially, the HTML contents contain valuable information about the individual health condition and graph knowledge presents the semantics of words in the contents. This paper focuses on combining these two for extracting causality knowledge. Understanding causality relations is one of the crucial tasks to support building an Arti cial Intelligent (AI) enabled healthcare system. Unlike other raw data sources used by AI processes, the causality semantic dataset is generated in this paper, which is believed to be more e cient and transparent for supporting AI tasks. Currently, neural network-based deep learning processes found themselves in a hard position to explain the prediction outputs, which is majorly because of lacking knowledge-based probability analysis. Dynamic probability analysis based on causality modeling is a new research area that not only can model the knowledge in a machine-understandable way but also can create causal probability relations inside the knowledge. To achieve this, a causal probability generation framework is proposed in this paper that extends the current Description Logic (DL), applies semantic Natural Language Processing (NLP) approach, and calculates runtime causal probabilities according to the given input conditions. The framework can be easily implemented using existing programming standards. The experimental evaluations extract 383 common disease conditions from the UK NHS (the National Health Service) and enable automatically linked 418 condition terms from the DBpedia dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Graph</kwd>
        <kwd>Causality</kwd>
        <kwd>Health</kwd>
        <kwd>NLP</kwd>
        <kwd>AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>There are many high-quality health condition data available online, such as the
UK website of National Health Service and condition descriptions on Wikipedia.
Understanding the causal relations inside this data will be useful to enhance
self-healthcare awareness and education. The research problem is how to extract
these causal relations automatically and understand the semantics from this
data e.g. sentences and paragraphs. For example, extracting Pneumonia is a
kind of disease and the coronavirus another kind of disease is one of the causes
to Pneumonia from the sentence "Pneumonia can be caused by a virus, such
as a coronavirus (COVID-19)". Besides, the probability is also an important
aspect of the causality due to uncertainties e.g. pneumonia can be caused not
only coronavirus but also other bacterial infections. In this paper, a
probabilitybased causality extracting and modeling framework is proposed to address this
research problem. Two major novelties of the paper are:</p>
      <p>(1) A formal health causality extracting framework is proposed to support
causal recognition, knowledge modeling, and runtime probability creation.</p>
      <p>(2) The rst causal knowledge graph is created containing 383 health
conditions from the UK NHS website with causal links to 418 Wikipedia health terms
through DBpedia annotations.</p>
      <p>Rest of the paper has further 4 sections:
Section 2 will discuss some related work.</p>
      <p>Section 3 will explain the whole framework and each of the steps.</p>
      <p>Section 4 will show the insight evaluation of the generated causal knowledge
graph.</p>
      <p>Section 5 will present the conclusion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Representing health knowledge in a way that the machine can easily process is
an important research area. The core topics in the eld can be categorized into
three major groups.</p>
      <p>
        One is focusing on representing clinical data as knowledge, e.g. Electronic
Health Records (EHR). An integration process to build a common data model
was proposed by [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] aimed to produce shareable, transportable, and
computable clinical data. However, the work only emphasized the system
architecture level (NoSQL) and data representation level (RDF) but did not directly
address knowledge understanding especially the causal relations. Many di erent
frameworks worked in this direction of ontology development and triple
populating.
      </p>
      <p>
        The second category is to apply state of art machine learning approaches to
the existing KG data to perform prediction or classi cation tasks. The paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
proposed a medical code prediction framework to build a KG with NLP and
external Wikipedia semantic links to the information source. The prediction results
through graph vector encodings applied to the logistic regression classi cation
algorithm. However, these knowledge prediction approaches lack of explanations
and tractability. Moreover, they still can not tell the causes of such a prediction.
      </p>
      <p>
        The last direction is to directly add causal knowledge to the data. This type
of research can be traced back to the 1980s as Neyman-Rubin causal inference
theory was published. However, the concepts of causal and association or
correlation are always been mixed or misunderstood until the formal mathematics
models are represented by Pearl in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The model computes probability joint
distribution on the directional graph that satis es the back-door criterion, which
is a do(X=x) rather than a random x to have a probability prediction on Y
based on statistic knowledge. In simpli ed terms, the causal relation should be
observed if one property were modi ed, then the other property of a probability
distribution would also change. Therefore, we can distinct associational
relations and causal relations. Most recently, this idea has been applied on top of
the reinforcement learning process by the DeepMind team [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. At the same time,
some work starts to investigate an approach to add probability concepts into
knowledge graphs to express knowledge with belief rating thresholds. Based on
this idea, a Probabilistic Description Logic (PDL) was explained in 2017 [8] to
deal with subjective uncertainty. The PDL extended Tbox and Abox de nitions
in the classic Description Logic (DL) with probabilistic thresholds notations.
However, the probability needs to be de ned at design time or from current
knowledge not able to be tuned dynamically. In addition, it completely does not
model the causal relations but is replaced by probability.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Causality Knowledge Extracting and Modelling</title>
      <p>Overall, the causality extracting framework contains four major approaches as
shown in Fig. 1. The CNN algorithm is applied to identify the sentences that
contain causal relations. The composition of the NLP and semantic annotation
process is developed in generating semantic word tokens. The causality
description logic is introduced to guide the causality knowledge graph generation by
lifting the semantic word tokens. Finally, the runtime probability knowledge
graph with de ned probabilities will be created when certain inputs values are
calculated accordingly.
3.1</p>
      <sec id="sec-3-1">
        <title>Causality recognition</title>
        <p>
          Two methods have been applied in this approach. One is to directly believe that
certain sections of the web contents that should contain causality knowledge. For
example, the symptom and causes sections, which can be de ned based on the
research of interests. The other method is to build a recognition AI model that
can identify sentences that has causality statement(s). There are two most recent
research results shows using self-attention deep neural networks can achieve more
than 70 percentage accuracy on this task [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ]. However, the scenarios are more
complex to detect and categorise multiple causal e ect classes. In addition, these
algorithms are too expensive in terms of computing resources and time. Our
task majorly tells if the sentence contains causality that is a binary question. A
cheap solution is also the requirement in our scenario. To achieve it, ve di erent
machine learning algorithms have been evaluated based on a training dataset.
The training dataset is composed of two datasets from the previous research
work presented in [9]. Table 1 shows that CNN model provided the best result
of recognising causal sentences.
        </p>
        <p>Where T is the T-box ontology (Terminology structure). A is the A-box
instance (Assertions) and is the root causal function that is the major extension
to traditional DL- t. presents the causal relation that can be happened between
any concepts de ned inside T. A subclass of can be de ned to indicate speci c
causal relations between two concepts. P ( ) tells the probability values of causal
relations between two instances at Abox level and importantly only at runtime.
A set of runtime P ( ) is calculated based on the input observations.</p>
        <p>For the health condition application scenario, Fig. 2 presents the de ned
Tbox and in OWL schema that includes twelve concepts and ten causalities ( )
and three normal relations.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Causality extraction and lifting process</title>
        <p>The causality extraction process has two components:</p>
        <p>(1) NLP-based causal keywords tokenization is to capture the keywords that
may have causal relations in the identi ed causality texts from previous steps.
The tokenization follows classic NLP steps of segmentation, word tokenization,
remove stop words, stemming, and eventually get the noun keywords or phrases.
For example, the words of pneumonia, virus, and coronavirus will be captured
from the sentence of "Pneumonia can be caused by a virus, such as a coronavirus
(COVID-19)"</p>
        <p>(2) Semantic lifting calls semantic annotation API (DBpedia spotlight) to
classify the keywords and phrases into di erent terms de ned in CPKB ontology
based on the RDF: type and other related predictions described in the DBpedia
dataset. For example, the word 'Lung' is a type of DBpedia anatomical structure
class de ned by RDF: type of lung RDF data.</p>
        <p>Based on the above two components, we can extract causality for given
sentences or paragraphs. In the end, we can generate a knowledge graph for each
crawled health conditions from these CPKB based semantic populating.
Currently, 383 health conditions' knowledge graph is integrated from the UK NHS
website with additional causal semantic links to 418 Wikipedia health terms
through the DBpedia dataset.
3.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Causality-based runtime probability knowledge graph</title>
        <p>With health condition causality knowledge in hand, the runtime probability
knowledge graph can be dynamically generated based on the numbers of income
links to each of the inputs. For example, the input observed conditions for a boy
(child and male) are:</p>
        <p>Symptoms: cough, breathing, fever, heartbeat, chest pain, fatigue, and
shivering, infection. Unwell body position: lung.</p>
        <p>With these input conditions, the Fig. 3 (the partial graph of the actual graph
as an example) presents a runtime probability distribution among relevant causal
relations. For instance, the Pneumonia disease has around 0.0054 and 0.018
causal probabilities for problems of Heartbeat and Cough respect.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Insight of Causality Knowledge Graph</title>
      <p>After crawled health conditions throughout the NHS webpages and built
semantic causal relations with Wikipedia de nitions and DBpedia terms, we
generated a causality knowledge graph that contains 801 health conditions, 1078
symptoms/physiologies, 377 treatments including drugs, 8 categorized habits,
66 di erent human groups, and 113 species.</p>
      <p>Fig. 4 shows 25 symptoms or physiological re ections that have the most
connections with other health conditions. Interestingly, Schizophrenia a kind of
mental health condition can be developed from 264 diseases. The other
noticeable information is that many diseases may have sequela and contribute to rare
diseases. The gure also indicates that diabetes is one of the most common
symptoms of other diseases.</p>
      <p>Based on the causal relations, eight habits or lifestyle-related scenarios can
contribute to developing serious health problems. The top one is the
smokingrelated habits are most dangerous and connect to more than 100 diseases. The
other noticeable one is overeating.</p>
      <p>The causal reasoning result also shows that Autumn and Winter have the
most connections to diseases than other seasons which re ects common sense.</p>
      <p>Through causal relations, the condition chain is discovered. For example,
Rheumatoid arthritis ! Psoriasis ! Pagets disease nipple ! Breast cancer
! Weight loss. 3683 5-length-chains, 3847 4-length-chains, and 111186 3-length
chains are discovered so far. All these condition chains are the hidden knowledge
that is not identi ed in the original description on the webpages.</p>
      <p>Besides, the health conditions from NHS are clustered into 42 groups when
applying unsupervised K-mean clustering algorithm and cluster optimization
process. For example, a list of observations [`headache, in uenza, fever, throat,
children'] is mostly related to the health condition in Cluster 0 that contains
12 diseases of ['Bornholm-disease', 'Common-cold', 'Diphtheria', 'Chickenpox',
'Flu', 'Hand-foot-mouth-disease', 'Polio', 'Q-fever', 'Roseola', 'Rubella',
'Slappedcheek-syndrome', 'Tonsillitis'].
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>A causality focused knowledge graph generation approach is introduced in this
paper. The major purposes of the work are to extract causal relations inside the
health descriptive data on the Web and to create a probability knowledge space
at runtime to support further AI tasks. The evaluations on the causal probability
knowledge graph have already shown some interesting conclusions and the ability
to enhance explanation capabilities of prediction and clustering approaches. The
implementation code and the dataset are available at [10]. There are two
limitations at current state of art. The rst one is that some combination key words
e.g. Body pain have not been captured using classic NLP and semantic
annotation processes. The second one is that our knowledge has not fully connected to
external exist health knowledge datasets e.g. UMLS [11]. In the short-term, our
research will focus on addressing these limitations. The long-term future research
has a couple of directions. Firstly, to develop an e cient embedding method that
can contain causal relation features and apply well-studied machine learning
algorithms especially the deep learning architectures. Secondly, to investigate the
graph-based learning algorithm that can directly work on the graph data and
get utilization from the reasoning power from the graph, causal relations, and
runtime probability de nitions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Overhage</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ryan</surname>
            ,
            <given-names>P. B.</given-names>
          </string-name>
          , Reich, C. G.,
          <string-name>
            <surname>Hartzema</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stang</surname>
            ,
            <given-names>P. E.</given-names>
          </string-name>
          :
          <article-title>Validation of a common data model for active safety surveillance research</article-title>
          .
          <source>Journal of the American Medical Informatics Association : JAMIA</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <volume>54</volume>
          {
          <fpage>60</fpage>
          ,
          <year>2012</year>
          . https://doi.org/10.1136/amiajnl-2011
          <source>-000376</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Rosenbloom</surname>
          </string-name>
          , ST.,
          <string-name>
            <surname>Carroll</surname>
          </string-name>
          , RJ., Warner, JL.,
          <string-name>
            <surname>Matheny</surname>
          </string-name>
          , ME.,
          <string-name>
            <surname>Denny</surname>
          </string-name>
          , JC.:
          <article-title>Representing Knowledge Consistently Across Health Systems</article-title>
          .
          <source>Yearb Med Inform</source>
          .
          <year>2017</year>
          ;
          <volume>26</volume>
          (
          <issue>1</issue>
          ):
          <fpage>139</fpage>
          -
          <lpage>147</lpage>
          . https://doi.org/10.15265/IY-2017-018
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vucetic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Improving Medical Code Prediction from Clinical Text via Incorporating Online Knowledge Sources</article-title>
          .
          <source>In The World Wide Web Conference (WWW '19)</source>
          , Ling Liu and Ryen White (Eds.). ACM, New York, NY, USA,
          <year>2019</year>
          ,
          <fpage>72</fpage>
          -
          <lpage>82</lpage>
          . https://doi.org/10.1145/3308558.3313485
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Pearl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>An Introduction to Causal Inference</article-title>
          .
          <source>The International Journal of Biostatistics. 6</source>
          ,
          <issue>2</issue>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dasgupta</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al:
          <article-title>Causal Reasoning from Meta-reinforcement Learning</article-title>
          .
          <source>arXiv preprint</source>
          <year>2019</year>
          , arXiv:
          <year>1901</year>
          .08162.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
          </string-name>
          , J.:
          <source>Causality Extraction based on Self-Attentive BiLSTM-CRF with Transferred Embeddings</source>
          ,
          <year>2019</year>
          arXiv:abs/
          <year>1904</year>
          .07629.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dasgupta</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naskar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automatic Extraction of Causal Relations from Text using Linguistically Informed Deep Neural Networks</article-title>
          ,
          <year>2018</year>
          ,
          <fpage>306</fpage>
          -
          <lpage>316</lpage>
          .
          <fpage>10</fpage>
          .18653/v1/
          <fpage>W18</fpage>
          -5035.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>