<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Ontology-Based Approach to Enhance Text Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sonika Malik</string-name>
          <email>sonika.malik@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarika Jain</string-name>
          <email>jasarika@nitkkr.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Maharaja Surajmal Institute of Technology</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Technology</institution>
          ,
          <addr-line>Kurukshetra</addr-line>
        </aff>
      </contrib-group>
      <fpage>85</fpage>
      <lpage>98</lpage>
      <abstract>
        <p>Text Classification is the process of defining a collection of pre-defined classes to free-text. It has been one of the most researched areas in machine learning with various applications such as sentiment analysis, topic labeling, language detection and spam filter etc. The efficiency of text classification improves, when some relation or pattern in the data is given or known, which can be provided by ontology. It further helps in reducing the size of dataset. Ontology is a collection of data items that helps in storing and representing data in a way that preserves the patterns in it and its semantic relationship with each other. We have attempted to verify the improvement provided by the use of ontology in classification algorithms. The code prepared in this research and the method developed is pretty generic, and could be extended to any ontology based text classification system. In this paper, we present an enhanced architecture that can uses ontology to provide an effective text classification mechanism. We have introduced an ontology based text classification algorithm by utilizing the rich semantic information in Disease ontology (DOID). We summarize the existing work and finally advocate that the ontology based text classification strategy is better as compared to conventional text classification in terms of different metrics like Accuracy, Precision, Recall, and F-measure etc.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Text Classification</kwd>
        <kwd>Ontology</kwd>
        <kwd>Semantic AI</kwd>
        <kwd>Symbolic AI</kwd>
        <kwd>Statistical AI</kwd>
        <kwd>Classifier</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The classification of entities based on the available data
is the foundation for classification techniques. The
available data could be of two types- the information
that we have on hand, and the information that we have
previously used for classification. Either way, an
accurate and precise classification relies on the amount
of information that is available to us. The ways of
processing and analysing information has been
transformed through digitization.
There is a plethora of textual data everywhere we look
around, from magazines to journals to papers. There is
a need to systematically categorize and interpret this
information without compromising time. Automated
text classification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is one of the most helpful tools for
this.
      </p>
      <p>
        It’soneofthe most important and rudimentary features
in Natural Language Processing (NLP) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], with broad
applications such as sentiment analysis [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], topic
labelling, spam detection [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and intent detection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Text classifier [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] are made and meant to be
implemented on a diverse range of textual datasets. Text
classification can work on both, structured and
unstructured datasets. To understand the process of
classification and how ontology fits in this process,
there is a hierarchical progression as shown in Figure
1.
      </p>
      <p>
        Artificial Intelligence (AI) is anticipated to produce
hundreds of billions of dollars in economic value.
However, considering that technology forms part of our
everyday lives, many people remain suspicious. Their
key issue is that AI approaches perform like
blackboxes and seems to generate ideas without any
explanation. In addition, many industries recognised
knowledge graphs (KGs) as an effective method for
data processing, management and enrichment [
        <xref ref-type="bibr" rid="ref53">53</xref>
        ].
Although KGs are also increasingly recognisable as a
foundations of an AI system that makes explainable AI
via the design concept called “Human -in- the-Loop”
(HITL). The AI’s promise is to automatically derive
patterns and rules from massive datasets based on
machine learning algorithms such as deep learning. This
fits very well with particular issues and helps to simplify
classification activities in many situations. The machine
learning algorithms gain the knowledge from historical
information, but they cannot derive new results from it.
Without explanation, there is no confidence. Explain
ability ensures that trustworthy agents in the system are
able to understand and justifytheAIagent’sdecisions
[
        <xref ref-type="bibr" rid="ref50">50</xref>
        ].
      </p>
      <p>
        Semantic AI integrates symbolic AI and statistical AI.
It incorporates the approaches like machine learning,
information analysis, semantic web and text mining. It
combines the benefits of AI techniques, primarily
neural networks and semantic reasoning. It is an
improvement of the existing framework used primarily
to create AI-based systems. This brings fast learning
from less trained data, for example chatbots can be
developed without cold-start problem. Semantic AI
incorporates a radically different approach and
therefore complementary skills for additional
stakeholders. Although conventional Machine
Learning is primarily performed by data or information
scientists involved in Explainable AI or semantic AI. At
the heart of Semantic Enriched Artificial Intelligence
architecture, a semantic knowledge graph is used by
providing the means for a more automated data quality
management [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For the better quality data and more
options in feature extraction, semantically enhanced
data works as a base. It gives the better accuracy of
classification and prediction intended by machine
learning algorithms. Semantic AI aims to have an
infrastructure to address the knowledge asymmetries
between designers of AI applications and other
stakeholders including customers and decision makers,
in direct reference to AI systems which ‘work like
magic’ wh ere only some of the analysts actually
recognise the fundamental techniques [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Ontology- Ontology specifies a conceptualization of a
domain in terms of concepts, attributes, and relations
[
        <xref ref-type="bibr" rid="ref49">49</xref>
        ]. In simple terms, Ontology is analogous to a
dictionary, which stores the information about entities.
This information usually consists of the features and
relations of the said entities [
        <xref ref-type="bibr" rid="ref51 ref52">51, 52</xref>
        ]. The immense
importance of ontology is utilised in the research fields,
such as data science, where, it eases information
processing because of its organised structure, as
compared to the more conventional ways of processing
raw data. The formal ontology, thus, represents data in
an organised way and used as a framework [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Ontology based text Classification- For Machine
Learning (ML) style classification, algorithms such as
Naive Bayes (NB) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or Support Vector Machine
(SVM) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] etc. are used, where we train a model to
read text as feature vectors and output as one of n
classes. One use of ontology would be to mark-up
entities in the text. In our case, we have a medical
ontology like DOID, whose nodes have information
about various diseases, symptoms, medications, etc.
We could look for these entities in our text and mark
them as single entity - so for example, if we found the
string“LungCancer”inourtextwhichisalsoanode
in our ontology, we could replace all occurrences of
“LungCancer”withasingletoken“Lung_Cancer”and
treat this token as a feature for our classification. These
ontology nodes usually contain multiple versions of the
stringthatrepresentsit.Forexample,“heartattack”is
also known as “myocardial infarction”, so if our text
contains either string, they could be normalized down
to one single string and treated as a single feature for
classification. For rule-based classifiers such as
Bayesian Networks or decision tree algorithms [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], we
could also leverage the knowledge in the ontology to
create generalized rules.
      </p>
      <p>The remaining paper has been organised as follows:
Section 2 describes the related work in the field of text
classification. Section 3 defines the background
knowledge. Section 4 presents the assessment of
proposed system. Section 5 describes the comparison
and results and finally paper ends with conclusion and
future scope.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Angelo A. Salatino, Thiviyan Thanapalasingam,
Andrea Mannocci, Francesco Osborne and Enrico
Motta [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] came up with the Computer Science
Ontology (CSO). The CSO consists up to twenty-six
thousand domains, and as many as two hundred and
twenty-six thousand interpretable relations between
these domains. To support its availability, they also
developed the CSO Portal, a web application which
allows users to explore the ontologies and send
feedback.
      </p>
      <p>
        Angelo A. Salatino, Francesco Osborne and Enrico
Motta [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduced the CSO Classifier for automatic
classification of research papers according to the
Computer Science Ontology (CSO). It is an
unsupervised approach. For every research Meta data,
the CSO takes as input, it returns a list of suitable topics
that could be used in classifying the said input research
paper.
      </p>
      <p>
        Angelo A. Salatino, Francesco Osborne and Enrico
Motta [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] presented a CSO classifier for automatic
classificationof academic papers according to CSO’s
rich taxonomy of subjects. The aim is to promote the
acceptance of CSOs through the various communities
involved in scholarly data and enable the creation of
new applications that rely on this knowledge base. This
paper proposed four stages:
(a) Constructing research ontology, (b) Classifying
new research proposals into disciplines, (c) building
research proposal clusters using text mining, (d)
balancing research proposals and regrouping them by
consideringapplicants’characteristics .
      </p>
      <p>
        Preet Kaur and Richa Sapra [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] also researched in a
similar domain, wherein, they proposed
OntologyBased text mining methods for classification of
research proposals as well as external research
reviewers.
      </p>
      <p>
        Chaaminda Manjula Wijewickrema and Ruwan
Gamage [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] addressed the fallacies in manual
classification and proposed ontology based methods for
fully automatic text classification.
      </p>
      <p>
        A Sudha Ramkumar, B Poorna and B. Saleena [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
used WordNet ontology to perform ontology based
clustering of sports related terms, so as to preserve the
semantic meaning behind terms while clustering them.
Nayat Sanchez-Pi, Luis Marti and A.C.B. Garcia [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]
presented a probing algorithm for the automatic
detection of accidents in occupational health control.
The proposal has more accurate heuristics because it
contrasts the relevance of techniques used with the
terms. The basic accident detection problem is divided
into three parts: (i) text analysis, (ii) recognition and
(iii) classification of failed techniques which caused
accidents.
      </p>
      <p>
        Decker [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] presented a different approach to
categorize research papers by using the words present
in the papers abstract. It is an unsupervised method
which evaluates the relevance of suitable topics for the
research paper on various time scales.
      </p>
      <p>
        Herrera et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] devised a way to categorize research
papers specific to the domain of physics. They did this
with the help of PACS, which stands for Physics and
Astronomy Classification Scheme. They created a
network like structure where, a PACS code was
assigned to every topic node, and a connection between
two nodes was possible only if their codes co-occur
together in at least one paper.
      </p>
      <p>
        Ohniwa et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] gave a similar analysis in the field of
biomedicine. They used the Medical Subject Heading
(MeSH).
      </p>
      <p>
        Mai et al [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] showed that the performance of their
model, which was only trained using titles, was as good
as the models trained by mining the full texts of papers
and articles. They developed their approach using deep
learning techniques. As training set, they used
scientific papers from EconBiz and PubMed,
respectively annotated with the STW Thesaurus for
Economics (approximately five thousand classes) and
MeSH (approximately twenty-seven thousand classes).
Cook et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] developed a method of allocation of
papers to reviewers optimally, to aid the selection
process.
      </p>
      <p>
        Arya and Mittendorf [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] suggested a rotation based
method for the assignment of projects.
      </p>
      <p>
        Choi and Park [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] offered a solution for Research and
Development proposal classification, which was text
mining based.
      </p>
      <p>
        Girotra [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] proposed a study for the evaluation of
portfolio projects.
      </p>
      <p>
        Sun et al. [
        <xref ref-type="bibr" rid="ref29 ref30">29, 30</xref>
        ] developed a mechanism for
assessment of reviewers, who would evaluate the
research papers. Mehdi Allahyari, Krys J. Kochut and
Maciej Janik [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] proposed a way of dynamic
classification of textual records in dynamically
generated classes.
      </p>
      <p>
        Rudy Prabowo, Mike Jackson, Peter Burden and
Heinz-Dieter Knoell [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] developed a web page
classifier. Its classification was with reference to the
Dewey Decimal System and Library of Congress
Classification schemes.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Background Knowledge</title>
      <p>In this section we have discussed the pre-processing
steps for textual data and Machine Learning classifiers
that are being used in our research.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1. Pre-Processing Textual data</title>
      <p>
        According to the official documentation, the Natural
Language Toolkit (NLTK) [
        <xref ref-type="bibr" rid="ref48">48</xref>
        ] is a platform used for
building Python programs that work with human
language data for applying in statistical Natural
Language Processing (NLP). It is a useful tool in
python, which helps in processing a diverse range of
languages by providing algorithms for it. This tool is
powerful because it is free and open source. Also, one
does not need to look for any special tutorials when
using the NLTK, as its official documentation is very
well described. The most common algorithms used in
NLTK are tokenization, lemmatization, part of speech
tagging etc. These algorithms are essentially used to
preprocess textual data. The preprocessing takes place
in five parts:
Tokenization- A token is the fundamental building
block of any linguistic structure, such as a sentence or
a paragraph. The process of tokenization is to break
these structures down into tokens. Tokenizer could be
of two types – a sentence tokenizer’s tokens are
sentences. It, therefore, breaks paragraphs down into
sentences. A work tokenizer identifies words as tokens.
It, hence, disintegrates sentences into words.
Stemming- A stem is the root word or phrase from
which different forms of that word could be derived.
Stemming is the process of identifying all the words
that were derived from the same stem and reduce them
or normalize them back to their stem form. For
example, connection, connected, connecting word
reduce to a common word "connect".
      </p>
      <p>Lemmatization- Sometimes, we may encounter words
that have different stems but the same final meaning. In
such a case, there is a need for a dictionary lookup to
further reduce the stems to their common meaning, or
the base word. This base word is known as lemma, and
hence the name lemmatization. For example, the word
"better" has "good" as its lemma. Such cases are missed
out during stemming because these two words not at all
alike, and would need a dictionary lookup where, their
meanings can confirm the lemma.</p>
      <p>POS Tagging- It stands for Part of Speech, and just as
the name suggests, it identifies the various parts of a
linguistic structure like a sentence. The different parts
could be an adjective or a noun or a verb. It does so by
studying the structure of the sentences and observing
the arrangement of words and the relation between the
various words.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2. Text classification and classifiers</title>
      <p>
        The idea behind text classification is to group text into
categories using machine learning. It finds use in many
relevant areas such as sentiment analysis, emotion
analysis, etc. There have been many classifiers
developed for each classification category. As stated
previously, text classifier is made and meant to be
implemented on a diverse range of textual datasets. Text
classification can work on both, structured and
unstructured datasets. Both types of datasets find
numerous applications in various fields. The
Classification process in machine learning can be
explained very simply. First, we assess and analyze the
training dataset for boundary condition purposes. Next,
we predict the class for new data using the information
obtained and learned during the training phase. This is
essentially the whole process of classification.
Classification could be either supervised or
unsupervised. Supervised classification [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] of works
on the principle of training and testing, and uses labeled
data, i.e. predefined classes, for prediction. In the
training phase, the model is made to learn some
predefined classes by feeding it labeled or tagged data.
In the testing phase, the efficiency of the model’s
prediction or classification is measured by feeding it
unobserved data. In other words, it can only predict
those classes in the testing phase which, it has learnt in
the training phase. Some common examples of
supervised classification are spam filters, intent
detection, etc. Unsupervised classification [
        <xref ref-type="bibr" rid="ref34 ref35">34, 35</xref>
        ]
involves classification by the model without being fed
the external information. In this, the algorithm of the
model tries to group or cluster data points based on
similar traits, patterns and other common features that
can be used to tie two data points together. A common
example where unsupervised classification is really
helpful is the search engines. They create data clusters
based on insights generated from previous searches.
This type of classification is extremely customizable
and dynamic as there is no need for training and tagging
for it to work on textual datasets. Thus, the
unsupervised classification is language compatible.
The classifiers used for text-classification could be ML
based, such as Naïve-Bayes Classifier, Decision Tree
Classifier [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ] etc., or it can be based on Neural
Network architecture such as Artificial Neural
Network, Convolutional Neural Network etc. [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ].
The machine learning based classifiers that can be used
for text classification are:
(a) Naive Bayes classifier [
        <xref ref-type="bibr" rid="ref38 ref39">38, 39</xref>
        ] - It uses the Bayes
theorem to predict values. This algorithm is good for
multi-class classification. Consider a data point x, in a
multi-class scenario with three classes- A, B and C.
Using Naïve Bayes, we try to predict whether the data
point x belongs to class A or B or C, by calculating its
probability for the three classes as given in Eq. 1.
 ( | ) = ( ( | ). ( ))/ ( )
(1)
Thisalgorithmiscalled‘Naïve’ because it assumes that
all the features are independent of each other as defined
in Eq. 2.
      </p>
      <p>( 1,  2,  3 …   ) =  ( 1) =  ( 2) = ⋯  (  )
(2)
There are further two categories of the NB Classifier
one is Gaussian NB Classifier and other one is
multinomial NB Classifier.</p>
      <p>
        The Gaussian Naïve-Bayes classifier is used when a
dataset has continuous values of data. It uses the
Gaussian Probability Distribution function (values are
centered on mean and as the graph grows, the values
decrease). The Multinomial Naive Bayes algorithm
assumes the independence of features, and the
multinomial component of this classifier ensures that
the distribution is multinomial in its features.
(b) Decision Tree [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] - It is a highly intuitive
algorithm which uses greedy approach. To construct a
decision tree, we have to perform the following steps –
(1) select a feature to split the data, (2) select a method
to split the data on the said feature. It has the internal
working algorithm as: (i) Create/Select a node. (ii) If
the node is pure, output the only class. (iii) If no feature
is left to split upon, and the node is impure, output the
majority class. (iv) Else find the best feature to split
upon. Recursively call on this split. Go to b.
(c) K-Nearest Neighbor [
        <xref ref-type="bibr" rid="ref38 ref40">38, 40</xref>
        ] - Consider a scenario
where we have to predict to which class, the testing
point belongs to, by considering all the features at once.
Such is the working of KNN algorithm as shown in
Figure 2. To predict the class of the testing data point,
we check its vicinity. To classify the testing point, we
check a specific number of points (1, 3, 5, 7, etc.) and
whichever class is in majority among those, that one is
predicted. To select the nearest point, we have to
consider its distance from the other points. The distance
metric can be (a) Manhattan distance, (b) Euclidian
Distance, (c) Minkowski distance.
(d) Random forest [
        <xref ref-type="bibr" rid="ref38 ref41">38, 41</xref>
        ] - It is an extension of the
decision tree classifier. This algorithm uses multiple
combinations of decision trees to accurately predict
testing data. The random forest classifier overcomes
the over-fitting problem of decision trees by building
multiple decision trees and going with the majority
result.Thetrees’outputsvarybecauseeachtreeisbuilt
with random data and random features. To generate
randomness in trees, we use two
techniques(i) Bagging: If we have ‘m’ data points, we select a
subsetof‘k’out of them.For‘n’trees,n*ksubsetsare
selected. Data points can be considered with
replacement as the selection is random; therefore, these
trees are called bag trees.
(ii) Feature Selection: In the training phase, some
features are selected at random in this technique, with
the condition that the selection is performed without
replacement.
(e) SVM Classifier [
        <xref ref-type="bibr" rid="ref38 ref42">38, 42</xref>
        ] - It is a very powerful
algorithm and overcomes the limitation of logistic
regression. As logistic regression uses sigmoid1.
function, the value predicted for a testing data point is
close to 0.5. This causes the problem of incorrect
prediction. So, SVM uses the rules of logistic
regression only, but exponentially increases the value,
so that the values predicted do not fall in the range (-1,
1).
      </p>
      <p>This cost function changes to the following equation
in SVM as given in Eq.2.</p>
      <p>
        ( ) =  ∑[  ( ) 
1 (   ( )) + (1 −  ( )
0.5 ∑(  )2
0(   ( ))] +
(3)
(f) Logistic Regression [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]: It is a primitive
classification algorithm which uses the sigmoid
function as in Eq. 4 at its core to perform classification.
1
 ( ) = 1 +  −
(4)
As the sigmoid has an exponential function, the graph
moves exponentially either towards 0 or 1 with a slight
change in x.
      </p>
      <p>The cost function of the binary logistic regression is
given in Eq. 5.</p>
      <p>
        (ℎ( ))= ∑(−  log (ℎ( )) − (1 −   ) log (1 − ℎ( )))
(5)
(g) Bagging Classifier [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ]: A Bagging Classifier is an
ensemble Meta estimating system that fits base
classifiers in each of the random subsets of the original
data sets and then combines their individual predictions
to form a final prediction. Usually, such a
metaestimator can be used to minimize the variance of a
black-box estimator by randomization.
      </p>
    </sec>
    <sec id="sec-6">
      <title>4. Proposed Study</title>
      <p>The classification by Machine Learning algorithms is
supposed to improve with the use of ontology. We aim
to verify this fact by studying and comparing values of
metrics such as accuracy, precision, recall and F1 score
for ontology based text classification and conventional
text classification.
4.1.</p>
    </sec>
    <sec id="sec-7">
      <title>Conventional Text Classification</title>
      <p>
        In the conventional classification the framework had
three main phases, (i) Dataset generation (ii) Model
training and testing (iii) Analyzing/Classifying results
as shown in Figure 3.
1. Dataset Generation: A premature knowledge
database of disease-symptom associations was
available on [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ] which consist of three columns named
as disease name, count of disease occurrence and the
symptoms; however, it needed modification to be used
for our research. Also some new information was
added to the dataset so that matching could be done
precisely. Thus the final dataset created, is the one that
was used for this proposed research. The modified
dataset and the ontology are compatible as they consist
ofclassification/outputfeature“diseasename”andthe
matching feature “disease description”. After the
ontology and a working dataset were obtained, cleaning
and preprocessing of the dataset was done, NLTK is
used for processing the dataset. A synthetic dataset is
also generated which involves creating new data using
programming techniques. In this research we created
multiple entries using the random feature value
selection of same class. For example, consider a disease
having 10 symptoms. We randomly select a subset of
these 10 symptoms and generate a new entry for the
dataset involving fewer symptoms and the disease
name. This process helps to bind the symptom values
to the disease and generate strong positive relation
between feature values (symptoms) and class (disease).
2. Model Training/Testing: This phase involves using
dataset and applying machine learning classifiers to it.
As the dataset initially contains text keywords, it needs 2. 3. Analyzing/ Classifying Results: To analyze the
to be converted into numbers using count vectorizer results, we compare the disease predictions for the
module. After this the training data is ready for feeding testing data with the actual disease class. After
to the classifier for training. The classifiers used are comparing we calculate the classification metrics like
KNN, SVM, Logistic Regression, Decision Tree and accuracy, precision, recall, F1-score. After this
Random Forest etc. After training we can use the model computation we can compare the performance of
for predictions on testing data. The ratio of training and multiple classifiers based on metrics. Also we can
testing data is 80 and 20 respectively. verify which classes seem to perform well base on
individual class-wise precision and recall values.
For the purpose of this research, we have used the
Human Disease Ontology, which was hosted at the
website for the Institute of Genome Sciences,
Maryland School of Medicine [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ]. This ontology is
comprehensive hierarchical controlled vocabulary for
human disease representation. It consists of unique
label for each disease which acts as identifier. The
owl file of the ontology was exported to csv file using
Protégé. We have presented a second phase between
dataset generation and Model training/Testing, in
which a hybrid approach for text classification is used
to optimize it. The presented methods/phases are: (i)
Dataset generation, (ii) Ontology Matching (iii)
Model Training and Testing iv) Analyzing and
classifying results as shown in Figure 4 (b).
      </p>
      <p>The phases i, iii and iv are explained earlier in section
4.1.</p>
      <p>Ontology Matching: In this phase the keywords
formed from the description of the disease are
matched with the keywords of ontology nodes. All
the matched nodes are possible classes which can be
used to create the subset of the data for efficient
model training. The use of priority based matching
helps us to further limit classes. In our research for
ontology matching each keyword is assigned two
numbers to specify its priority. The first number
describes frequency of the keyword and second
number describes whether the keyword can be
lemmatized or not. If it cannot be lemmatized it is
assigned as 1 otherwise 0. Thus each keyword has
syntax (name, first priority number, second priority
number). The steps for ontology matching are given
in Figure 4(a).
Algorithm 1: Ontology Based Text Classification
The Ontology matching function used in this Algorithm refers
to Algorithm 4.2
DOID: Disease Ontology
data_x, data_t= synthetic_data_generation (Knowledge_base)</p>
      <p>// Phase1
x_train, x_test, y_train, y_test =train_test_split (data_x, data_y)
Ontology_tree= Loading_Ontology ()</p>
      <p>//Phase 2</p>
    </sec>
    <sec id="sec-8">
      <title>5. Classification of Various Algorithms with and without Ontology</title>
      <p>Parameters used for evaluation and comparison of the
model when used with ontology v/s when used
without ontology is accuracy, precision, recall, and
F1 score.</p>
      <p>Accuracy describes the intuitiveness achieved by a
model after training. It takes into account all the
correctly predicted observations from the list of all
predictions.</p>
      <p>Accuracy = Number of correct predictions/ total
number of Predictions
Accuracy =</p>
      <p>TP+TN
Precision- Sometimes, a classifier may label a class
as true for classification of some raw data, when in
fact, it should have been false. This is the case of a
false positive. Precision takes into account the false
positives as well.</p>
      <p>Precision =
Recall =
 +
Recall- When a classifier marks a class as negative
for an unobserved data item, when in fact, it
should’ve been true, it is a case ofa false negative.
Recall accounts for the sensitivity of a model by
taking into account the false negatives.</p>
      <p>F1 Score =
 +
F1 Score is the weighted average of Precision and
Recall. Therefore, it factors in false positives as well
as false negatives. In case of an uneven class
distribution, F1 score becomes more important than
accuracy. Other times, when false negatives and
positives have the same cost, accuracy may be treated
as the superior evaluation parameter.</p>
      <p>2∗( ∗ )
( + )
Where TP- True Positive</p>
      <p>TN- True Negative
FP- False Positive</p>
      <p>FN- False Negative
In reference to the Table 1, the values of the metrics
Accuracy, Precision recall and F1-score have the
same magnitude. This is due to the fact that the FP &amp;
FN values are same in magnitude as there is less
number of records per disease. This results in an
equal value of precision, recall, F1-score for each
class as shown in Table 2.</p>
      <p>It can also be observed that decision tree classifier
shows the highest boost in accuracy, precision, recall
and F1 score being 0.75 for simple classification, and
0.85 for ontology based classification. There is a 10%
improvement in metrics for 500 test cases and 6% for
100 test cases for decision tree classifier. It is
followed by the KNN Classifier, which shows the 5%
improvement for both 100 &amp; 500 test cases. Rest
other classifiers have shown improvement in metrics
of around 1% - 3% as shown in Table 1.</p>
      <sec id="sec-8-1">
        <title>Naïve-Bayes</title>
      </sec>
      <sec id="sec-8-2">
        <title>Decision Tree KNN</title>
      </sec>
      <sec id="sec-8-3">
        <title>Random Forest SVM</title>
      </sec>
      <sec id="sec-8-4">
        <title>Bagging</title>
      </sec>
      <sec id="sec-8-5">
        <title>Logistic</title>
      </sec>
      <sec id="sec-8-6">
        <title>Regression</title>
      </sec>
      <sec id="sec-8-7">
        <title>Accuracy</title>
      </sec>
      <sec id="sec-8-8">
        <title>Precision</title>
      </sec>
      <sec id="sec-8-9">
        <title>Recall</title>
      </sec>
      <sec id="sec-8-10">
        <title>F1 Score</title>
      </sec>
      <sec id="sec-8-11">
        <title>Accuracy</title>
      </sec>
      <sec id="sec-8-12">
        <title>Precision</title>
      </sec>
      <sec id="sec-8-13">
        <title>Recall</title>
      </sec>
      <sec id="sec-8-14">
        <title>F1 Score</title>
      </sec>
      <sec id="sec-8-15">
        <title>Accuracy</title>
      </sec>
      <sec id="sec-8-16">
        <title>Precision</title>
      </sec>
      <sec id="sec-8-17">
        <title>Recall</title>
      </sec>
      <sec id="sec-8-18">
        <title>F1 Score</title>
      </sec>
      <sec id="sec-8-19">
        <title>Accuracy</title>
      </sec>
      <sec id="sec-8-20">
        <title>Precision</title>
      </sec>
      <sec id="sec-8-21">
        <title>Recall</title>
      </sec>
      <sec id="sec-8-22">
        <title>F1 Score</title>
      </sec>
      <sec id="sec-8-23">
        <title>Accuracy</title>
      </sec>
      <sec id="sec-8-24">
        <title>Precision</title>
      </sec>
      <sec id="sec-8-25">
        <title>Recall</title>
      </sec>
      <sec id="sec-8-26">
        <title>F1 Score</title>
      </sec>
      <sec id="sec-8-27">
        <title>Accuracy</title>
      </sec>
      <sec id="sec-8-28">
        <title>Precision</title>
      </sec>
      <sec id="sec-8-29">
        <title>Recall</title>
      </sec>
      <sec id="sec-8-30">
        <title>F1 Score</title>
      </sec>
      <sec id="sec-8-31">
        <title>Accuracy</title>
      </sec>
      <sec id="sec-8-32">
        <title>Precision</title>
      </sec>
      <sec id="sec-8-33">
        <title>Recall</title>
      </sec>
      <sec id="sec-8-34">
        <title>F1 Score</title>
        <p>0.978
0.978
0.978
0.978
0.852
0.852
0.852
0.852
0.854
0.854
0.854
0.854
0.952
0.952
0.952
0.952
0.992
0.992
0.992
0.992
0.912
0.912
0.912
0.912
0.99
0.99
0.99
0.99
The order of classifier w.r.t its magnitude is as follows:
The bagging classifier follows next with the values of the
parameters being 0.87 each in simple text classification,
and 0.89 in ontology based classification. Random Forest
Classifier comes next, as it shows the values of the
parameters as 0.96 and 0.97 in each classification case.
Naïve-Bayes Classifier and the SVM classifier have the
same values of all the parameters as 0.98 for simple
classification and 1.0 for ontology based text
classification. Now we will come to Logistic Regression,
which can be labelled as best classifier with the values of
metrics as 0.99 for simple Text classification while 1.0 for
ontology based text classification. There is minute
difference in the value of metrics of these classifiers. Also
the order of the increasing accuracy of various classifiers
(with or without using ontology) goes on as:
Decision Tree Classifier &lt; KNN Classifier &lt;
Bagging Classifier &lt; Random Forest Classifier &lt;
Naïve Bayes Classifier &lt; SVM classifier &lt; Logistic
Regression.
6.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Conclusion and Future Scope</title>
      <p>In this paper, the observations show that the ontology
based classification stands at a higher level than the
classification without ontology. The general pattern
indicates towards a more accurate and precise
classification using an ontology. All the parameters that
were used (accuracy, precision, recall, and F1 Score)
showed an elevation of 1% to 3% when the classification
was done with the help of ontology. It can be deduced that
using the ontology increased the efficiency of
classification. This advantage can be attributed to the fact
the number of possible classes for classification reduced
while, in turn, reduces the time taken for training purpose.
The results also indicate towards a more comparable
accuracy level amongst the classifiers when the ontology
was used. his study, while proving the importance and
benefits of ontology, still has a lot of scope for future
improvements. Future work is needed to improve the
dataset of diseases used in this project, as there was no
official dataset available for the human disease ontology.
Most of the work has been done on a limited dataset
obtained from converting the available ontology into a
dataset. There is also a need to optimize the code used for
ontology matching after the data preprocessing has been
done. One could also move on from machine learning
towards deep learning and build a neural network for this
dataset to further improve the results of the classification
in the future.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. V.
          <article-title>Korde “Text classification and classifiers: A survey</article-title>
          ,”
          <source>International Journal of Artificial Intelligence &amp; Applications (IJAIA) 3</source>
          (
          <issue>2</issue>
          ) (
          <year>2012</year>
          ):
          <fpage>85</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <source>Natural Language Processing, IEEE Fifth International Conference on Hybrid Intelligent Systems (HIS'05)</source>
          , Rio de Janeiro,
          <string-name>
            <surname>Brazil</surname>
          </string-name>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Vovsha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Rambow</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Passonneau</surname>
          </string-name>
          , “Sentiment Analysis of Twitter
          <string-name>
            <surname>Data</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <source>In Proc. WLSM-11.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Bhowmick and
          <string-name>
            <given-names>S.M.</given-names>
            <surname>Hazarika</surname>
          </string-name>
          ,
          <article-title>E-mail spam filtering: a review of techniques and trends</article-title>
          . In:
          <string-name>
            <surname>Kalam</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <article-title>Sharma</article-title>
          K (eds) Advances in electronics,
          <source>communication and computing. Lecture notes in electrical engineering</source>
          , 443. Springer, Singapore,
          <fpage>583</fpage>
          -
          <lpage>590</lpage>
          .
          <year>2018</year>
          . https://doi.org/10.1007/
          <fpage>978</fpage>
          -981-10-4765- 7_
          <fpage>61</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Akulick</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.S.</given-names>
            <surname>Mahmoud</surname>
          </string-name>
          ,
          <article-title>Intent Detection through Text Mining and Analysis</article-title>
          .
          <source>In Proceedings of the Future Technologies Conference (FTC)</source>
          , Vancouver, Canada,
          <fpage>29</fpage>
          -
          <lpage>30</lpage>
          November
          <year>2017</year>
          ;
          <fpage>493</fpage>
          -
          <lpage>496</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>K.Das</surname>
            and
            <given-names>R.N.</given-names>
          </string-name>
          <string-name>
            <surname>Behera</surname>
          </string-name>
          , “A Survey on Machine Learning: Concept, Algorithms and Applications,”
          <source>International Journal of Innovative Research in Computer and Communication Engineering</source>
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Blumauer, PoolParty Semantic Suite,
          <year>2018</year>
          ,URL:https://www.poolparty.biz/semantic-ai/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>8. https://www.slideshare.net/semwebcompany/semanti c-ai</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <article-title>James Hendler</article-title>
          and
          <string-name>
            <given-names>Ora</given-names>
            <surname>Lassila</surname>
          </string-name>
          , Scientific American: Feature Article: The Semantic Web: May 2001
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>K.A. Vidhya</surname>
          </string-name>
          , G. Aghila, “A Survey of Naïve Bayes
          <article-title>Machine Learning approach in Text Document Classification”</article-title>
          , (IJCSIS) International Journal of
          <source>Computer Science and Information Security</source>
          ,
          <volume>7</volume>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>T. JOACHIMS</surname>
          </string-name>
          ,
          <article-title>Text categorization with support vector machines: learning with many relevant features</article-title>
          .
          <source>In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, Germany</source>
          ,
          <year>1998</year>
          ),
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>D. E. Johnson</surname>
            ,
            <given-names>F. J.</given-names>
          </string-name>
          <string-name>
            <surname>Oles</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , T. Goetz,
          <article-title>“A decision-tree-based symbolic rule induction system for text Categorization”, by IBM systems journal</article-title>
          ,
          <volume>41</volume>
          (
          <issue>3</issue>
          )
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Thanapalasingam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mannocci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          and E. Motta,“ClassifyingResearchPapers with the Computer Science Ontology,” Knowledge Media Institute, The Open University, MK7 6AA,
          <string-name>
            <surname>Milton</surname>
            <given-names>Keynes</given-names>
          </string-name>
          , UK,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>A.A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Motta</surname>
          </string-name>
          , The Computer Science Ontology:
          <string-name>
            <given-names>A</given-names>
            <surname>Large-Scale Taxonomy</surname>
          </string-name>
          of Research Areas, 17th International Semantic Web Conference, Monterey, CA, USA, October 8-
          <issue>12</issue>
          ,
          <year>2018</year>
          , Proceedings, Part II
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Motta.</surname>
          </string-name>
          <article-title>The CSO classifier: Ontology-driven detection of research topics in scholarly articles</article-title>
          . In: A.
          <string-name>
            <surname>Doucet</surname>
          </string-name>
          et al. (eds.)
          <source>TPDL</source>
          <year>2019</year>
          :
          <article-title>23rd International Conference on Theory and Practice of Digital Libraries</article-title>
          . Cham, Switzerland: Springer,
          <year>2019</year>
          , pp.
          <fpage>296</fpage>
          -
          <lpage>311</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>030</fpage>
          -30760-8_
          <fpage>26</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. J. Ma, W. Xu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Turban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          , O. Liu, “AnOntology -Based
          <string-name>
            <surname>Text-Mining Method</surname>
          </string-name>
          to Cluster Proposals for Research Project Selection,” IEEE
          <source>Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans</source>
          ,
          <volume>42</volume>
          (
          <issue>3</issue>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>P.</given-names>
            <surname>Kaur</surname>
          </string-name>
          , R. Sapra, “OntologyBasedC
          <article-title>lassific ation and Clustering of Research Proposals</article-title>
          and External Research Reviewers”, International Journal of Computers &amp; Technology,
          <volume>5</volume>
          (
          <issue>1</issue>
          )
          <year>2013</year>
          , ISSN 2277-
          <fpage>3061</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>C.M. Wijewickrema</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Gamage</surname>
          </string-name>
          ,
          <article-title>An Ontology Based Fully Automatic Document Classification System Using an Existing Semi-Automatic System</article-title>
          ,
          <source>National Institute of Library and Information Sciences</source>
          , University of Colombo, Colombo, Sri Lanka,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>Sudha</given-names>
            <surname>Ramkumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Poorna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saleena</surname>
          </string-name>
          ,
          <article-title>Ontology based text document clustering for sports</article-title>
          ,
          <source>Journal of Engineering and Applied Sciences</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Nayat</surname>
            Sanchez-Pi,
            <given-names>Luis</given-names>
          </string-name>
          <string-name>
            <surname>Marti</surname>
          </string-name>
          and
          <string-name>
            <surname>A.C.B. Garcia</surname>
          </string-name>
          , “Improving ontology -based text classification: An occupationalhealthandsecurityapplication,”
          <source>Article Journal of Applied Logic · September 2015</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>S.L.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Aleman-meza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cameron</surname>
          </string-name>
          and
          <string-name>
            <given-names>I.B.</given-names>
            <surname>Arpinar</surname>
          </string-name>
          ,
          <article-title>Detection of Bursty and Emerging Trends towards Identification of Researchers, the Early Stage of Trends (</article-title>
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>M. Herrera</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          <string-name>
            <surname>Roberts</surname>
            and
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Gulbahce</surname>
          </string-name>
          ,
          <article-title>Mapping the evolution of scientific fields</article-title>
          .
          <source>PLoS ONE</source>
          .
          <volume>5</volume>
          (
          <issue>5</issue>
          ),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>R.L. Ohniwa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hibino</surname>
            and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Takeyasu</surname>
          </string-name>
          ,
          <article-title>Trends in research foci in life science fields over the last 30 years monitored by emerging topics</article-title>
          ,
          <source>Scientometrics</source>
          .
          <volume>85</volume>
          (
          <issue>1</issue>
          ),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <given-names>F.</given-names>
            <surname>Mai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Galke</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Scherp</surname>
          </string-name>
          ,
          <article-title>Using Deep Learning for Title Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text</article-title>
          ,
          <source>JCDL '18 Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (Fort Worth</source>
          , Texas, USA, Jun.
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>W. D. Cook</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Golany</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kress</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Penn</surname>
          </string-name>
          , and T. Raviv,“Optimalallocationofpro posals to reviewers to facilitate effective ranking,” Manage. Sci.,
          <volume>51</volume>
          (
          <issue>4</issue>
          ),
          <fpage>655</fpage>
          -
          <lpage>661</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26. Arya and B. Mittendorf, “Project assignment when budget padding taints resource allocation,” Manage. Sci., vol.
          <volume>52</volume>
          , no.
          <issue>9</issue>
          , pp.
          <fpage>1345</fpage>
          -
          <lpage>1358</lpage>
          , Sep.
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Choi</surname>
            and
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Park</surname>
          </string-name>
          , “R&amp;
          <article-title>D proposalscreening system based on text miningapproach,”Int</article-title>
          .
          <string-name>
            <given-names>J.</given-names>
            <surname>Technol</surname>
          </string-name>
          . Intell. Plan,
          <volume>2</volume>
          (
          <issue>1</issue>
          ),
          <fpage>61</fpage>
          -
          <lpage>72</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28. K.Girotra,C.Terwiesch,andK.T.Ulrich,“Valuing R&amp;
          <article-title>D projects in a portfolio: Evidence from the pharmaceuticalindustry</article-title>
          ,”Manage. Sci.,
          <volume>53</volume>
          (
          <issue>9</issue>
          )
          <fpage>1452</fpage>
          -
          <lpage>1466</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29. Y.H.Sun,J.Ma,Z.P.Fan,andJ.Wang,“Agroup
          <article-title>decision support approach to evaluate experts for R&amp;D project selection</article-title>
          ,” IEEE Transactions of Engineering management,
          <volume>55</volume>
          (
          <issue>1</issue>
          ),
          <fpage>158</fpage>
          -
          <lpage>170</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <given-names>Y. H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. P.</given-names>
            <surname>Fan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>“Ahybrid knowledge and model approach for reviewer assignment</article-title>
          ,”ExpertS
          <source>yst em Applications</source>
          ,
          <volume>34</volume>
          (
          <issue>2</issue>
          ),
          <fpage>817</fpage>
          -
          <lpage>824</lpage>
          , Feb.
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>M. Allahyari</surname>
            ,
            <given-names>K. J.</given-names>
          </string-name>
          <string-name>
            <surname>Kochut</surname>
            and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Janik</surname>
          </string-name>
          ,
          <article-title>Ontologybased text classification into dynamically defined topics</article-title>
          ,
          <source>Semantic Computing (ICSC)</source>
          ,
          <fpage>273</fpage>
          -
          <lpage>278</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <given-names>R.</given-names>
            <surname>Prabowo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Burden</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Knoell</surname>
          </string-name>
          ,
          <article-title>Ontology-Based Automatic Classification for the Web Pages Design Implementation and Evaluation"</article-title>
          ,
          <source>Proc. Of the 3rd International Conference on Web Information Systems Engineering</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33. F.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Osisanwo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.E.T.</given-names>
            <surname>Akinsola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Awodele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.O.</given-names>
            <surname>Hinmikaiye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Olakanmi</surname>
          </string-name>
          and
          <string-name>
            <surname>J. Akinjobi</surname>
          </string-name>
          <article-title>Supervised Machine Learning Algorithms: Classification and Comparison</article-title>
          ,
          <source>International Journal of Computer Trends and Technology (IJCTT)</source>
          ,
          <volume>48</volume>
          (
          <issue>3</issue>
          ),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>M. Khanum</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mahboob</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Imtiaz</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          <string-name>
            <surname>Ghafoor</surname>
            and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sehar</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          <article-title>Survey on Unsupervised Machine Learning Algorithms for Automation, Classification</article-title>
          and Maintenance,
          <source>International Journal of Computer Applications</source>
          (
          <volume>0975</volume>
          -
          <fpage>8887</fpage>
          )
          <volume>119</volume>
          (
          <issue>13</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wentian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhong</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>"The Analysis of the Ontology-based K-Means Clustering Algorithm"</article-title>
          ,
          <source>Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE</source>
          <year>2013</year>
          ), [online] Available: https://www.atlantis-press.
          <source>com/proceedings/iccsee13/4617.</source>
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <given-names>P.</given-names>
            <surname>Vateekul</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kubat</surname>
          </string-name>
          ,
          <article-title>Fast Induction of Multiple Decision Trees in Text Categorization From Large Scale,Imbalanced, and Multi-label Data</article-title>
          ,
          <source>IEEE International Conference on Data MiningWorkshops</source>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <given-names>S.</given-names>
            <surname>Dargan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ayyagari</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>A Survey of Deep Learning and Its Applications: A New Paradigm to Machine Learning</article-title>
          , Springer,
          <year>June 2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>38. https://jmlr.csail.mit.edu/papers/v12/pedregosa1 1a.html</mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39. S. Xu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Bayesian multinomial naïve bayes classifier to text classification</article-title>
          .
          <source>In: Advanced multimedia and ubiquitous engineering</source>
          . Springer,
          <fpage>347</fpage>
          -
          <lpage>352</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          40. G. Guo,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Greer</surname>
          </string-name>
          ,
          <article-title>KNN Model-Based Approach in Classification, Proc</article-title>
          . ODBASE pp-
          <volume>986</volume>
          - 996,
          <year>2003</year>
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          41. G.Biau,“AnalysisofaRandomForestsModel”,
          <source>Journal of Machine Learning Research</source>
          <volume>13</volume>
          (
          <year>2012</year>
          )
          <fpage>1063</fpage>
          -
          <lpage>1095</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          42.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <source>Study on Multi-label Text Classification Based on SVM, Sixth International Conference on Fuzzy Systems and Knowledge Discovery</source>
          <year>2009</year>
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>43. https://scikitlearn.org/stable/modules/generated/s klearn.ensemble.BaggingClassifier.html</mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>44. https://bioportal.bioontology.org/ontologies/DOI D</mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>45. http://people.dbmi.columbia.edu/~friedma/Proje cts/DiseaseSymptomKB/index.html</mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          46.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Freund</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.E.</given-names>
            <surname>Schapire</surname>
          </string-name>
          , A Short Introduction to Boosting” Journal of Japanese
          <source>Society for Artificial Intelligence</source>
          ,
          <volume>14</volume>
          (
          <issue>5</issue>
          ),
          <fpage>771</fpage>
          -
          <lpage>780</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          47.
          <string-name>
            <given-names>V.N.</given-names>
            <surname>Garla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brandt</surname>
          </string-name>
          ,
          <article-title>Ontology-Guided Feature Engineering for Clinical Text Classification</article-title>
          ,
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>45</volume>
          (
          <issue>5</issue>
          ):
          <fpage>992</fpage>
          -
          <lpage>998</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.jbi.
          <year>2012</year>
          .
          <volume>04</volume>
          .010
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>48. https://github.com/nltk/nltk</mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          49.
          <string-name>
            <given-names>D.</given-names>
            <surname>Fensel</surname>
          </string-name>
          . Ontologies:
          <article-title>Silver Bullet for Knowledge Management</article-title>
          and
          <string-name>
            <given-names>Electronic</given-names>
            <surname>Commerce</surname>
          </string-name>
          . Springer-Verlag,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          50. https://www.forbes.com/sites/forbestechcouncil/ 2019/12/30/explainable
          <article-title>-ai-the-rising-role-ofknowledge scientists/#62bc6193603f</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          51.
          <string-name>
            <given-names>S.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          .
          <article-title>Devising a super ontology</article-title>
          , Procedia Computer Science PP.
          <volume>785</volume>
          -
          <fpage>792</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          52.
          <string-name>
            <given-names>S.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          .
          <article-title>Ontology based context aware model</article-title>
          .
          <source>In Proceedings of the international conference on computational intelligence in data science (ICCIDS)</source>
          , p.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          53. S. Jain,
          <article-title>Understanding Semantics-based Decision Support”</article-title>
          , Nov
          <year>2020</year>
          ,
          <volume>152</volume>
          pages, CRC Press, Taylor&amp; Francis Group.
          <source>ISBN: 9780367443139 (HB)</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>