Relevance and Support calculation for Health information
           S. Suresh Kumar                                                                                       L Naveen
Department of Information Technology                                                         Department of Information Technology
        Assistant Professor                                                                          Assistant Professor
         JNTU Hyderabad                                                                                     BVRIT
            Hyderabad                                                                                    Hyderabad
           9440936885
   sureshsanampudi@gmail.com


ABSTRACT                                                                the target objects. Choosing an appropriate similarity measure is
Consumer health information search (CHIS) is a forum of                 also important for information retrieval task. In general, similarity
information retrieval that has organized two tasks to be performed.     measures plot the distance or similarity between the symbolic
The first task includes the identification that whether a given query   descriptions of two objects into a single numeric value[5]. Several
is relevant or irrelevant to the sentences available in the document.   syntactic similarity measures have been implemented in our model
The second tasks talks about finding the nature of support of a         for task 1, few of them are: Euclidean distance, cosine similarity,
sentence in the document to the query.                                  jaccard coefficient.

Task 1 i.e identification of relevance is done by performing            In implementing CHIS task-1 we have used cosine similarity,
different similarity measures calculations and finally averaging the    jaccard coefficient, TF-IDF similarity to find the relevance with
obtained score to find the relevance nature of a sentence. The          respect to syntactic nature of sentence and semantic similarity to
second task is solved by using a special type of support vector         find semantical relevance. The average score of the obtained scores
machine called C-support vector machines that can handle                were calculated to find the syntactic and semantic similarity of a
multiclass support. The results obtained from CHIS organizers           given sentence with respect to that of query.
shown that model developed for second task shown promising
results than that of the model developed for the first task.
                                                                        2.1 Cosine similarity
                                                                        In this measure computation the sentences and query are
Keywords                                                                represented as term vectors, the similarity is quantified as cosine
C Support Vector Machines, TF-IDF, Jaccard coefficient, cosine          angle between the query and a sentences vector that is, so-called
similarity.                                                             cosine similarity. Cosine similarity is the most widespread
                                                                        similarity measures applied to check similarity between texts.
1. INTRODUCTION                                                         Given a Sentence collection (S) and query (Q), the similarity
In this work we explore the Consumer Health Information Search          coefficient between them is computed using following formula:
(CHIS) task for finding the relevant information for the user query
from the given collection of sentence dataset. The task 1                                                          𝑆⃗ . 𝑄
                                                                                                                        ⃗⃗
implementation of CHIS aims to identify whether a given sentence                             𝑆𝐶1(𝑆⃗, 𝑄
                                                                                                     ⃗⃗) =
is relevant to the query or not. Whereas task 2 aims at the                                                      |𝑆⃗| 𝑋 |𝑄
                                                                                                                         ⃗⃗|
identification of whether support nature of the sentence with respect
to the query. In Task 1 of CHIS, retrieve the relevant information      Where 𝑆⃗ 𝑎𝑛𝑑 𝑄
                                                                                     ⃗⃗ are vector representation of sentence and query.
related to the user query, different similarity measures [5][6] have
been used. The similarity coefficient was computed between the
                                                                        2.2 Jaccard Coefficient
                                                                        The Jaccard coefficient, finds the Similarity measures between
query and sentences given in the document collection. The average
                                                                        finite sample sets. It is defined as the cardinality of the intersection
of these coefficients was identified and based on that value it is
                                                                        of sets divided by the cardinality of the union of the sample sets[3].
decided whether the sentence is relevant to query or not. Task 2 of
                                                                        For text similarity jaccard coefficient compares the sum of weight
CHIS aim is to identify whether each of the sentence in the given
                                                                        of shared terms to the sum of weights terms that are present in either
document collection is supporting or opposing or neutral to the
                                                                        of the document but are not shared terms. The formal definition is:
claim made in the query. It was treated using a special type of
support vector machine that includes the c-factor[7].                                                              𝑆⃗ . 𝑄
                                                                                                                        ⃗⃗
                                                                                         𝑆𝐶2(𝑆⃗, 𝑄
                                                                                                 ⃗⃗ ) =
                                                                                                             2          2
The rest of the paper is organized as follows. Section 2 discuss                                          |𝑆⃗| + |𝑄
                                                                                                                  ⃗⃗ | − 𝑆⃗ . 𝑄
                                                                                                                              ⃗⃗
about various approaches used to find the relevance computation
between the query and each sentence from document collection.           Where 𝑆⃗ 𝑎𝑛𝑑 𝑄
                                                                                     ⃗⃗ are vector representation of sentence and query.
Section 3 explain implementation of achieving the support nature
of a sentence with respect to query. Section 4 elaborates the dataset   2.3 TF-IDF Similarity
description and queries used for search. Section 5 concludes the        TF-IDF measures are a broad class of functions that are used for
paper.                                                                  computing similarity and relevance between queries and
                                                                        documents. The basic idea is that, the more frequently a word
2. TASK-1 Relevance Identification                                      appears in text, the more indicative that word is of the topicality of
To retrieve the relevant collection of sentences to the query, we       the text; and that the less frequently a word appears in a document
have calculated the similarity measures between the given query         collection, the greater its power to categorize between relevant or
and the sentence collection. Similarity between query and sentence      irrelevant.
collection was computed both in syntactic and semantic aspect. The
                                                                        The similarity function:
similarity measure reflects the degree of closeness or separation of
                                                         𝑁+1            correctly. C-SVM method provide a grid search method that
    𝑆𝐶(𝑆, 𝑄) =    ∑ log(𝑡𝑓𝑤,𝑄 + 1) log(𝑡𝑓𝑤,𝑆 + 1) log (         )
                                                       𝑑𝑓𝑤 + 0.5        implements a fit and score method that includes various function to
                 𝑤∈𝑄∩𝑅
                                                                        be implemented such as probability prediction, decision functions
Where tfw,Q is the number of times word w appears in query              and perform transformation and inverse transformation.
sentence Q; tfw,S is the number of times word w appears in sentence
                                                                         C-SVM is considered as supervised learning task, in which a
S; N is the total number of sentences in the collection; and dfw is
                                                                        model is built to learn from the training data and to predict the class
the number of sentences that w appears in.
                                                                        label for the unseen data. Support Vectors constructs a hyperplane
2.4 Semantic Similarity                                                 or set of hyperplanes in high dimensional space which is used for
                                                                        identification of support.
Semantic similarity measure the text similarity that is derived from
semantic and syntactic information contained in the given texts. To
compute the similarity, for each sentence, a raw semantic vector is
derived with the help of lexical database; and also a word order        4. Implementation Model
vector is formed for each of the sentences using the same               For implementing CHIS task, the organizing committee has been
information from lexical database. Each word in a sentence              given with training dataset of five documents in each
contributes differently to the complete meaning of the whole            approximately with 300 sentences and queries for retrieval process.
sentence, the importance of a word is weighted by using                 Training dataset consisting of total of three attributes, in which
information content resultant from corpus by combining the raw          attribute1 consists of sentence, attribute2 consists of relevant or
semantic with information from the corpus, a semantic vector is         irrelevant and attribute3 consists of polarity of the sentence towards
found for each of the two sentences[3]. Semantic similarity is          the query as oppose, support or neutral.
computed based on the two semantic vectors. An order similarity         The steps followed to complete the implementation of CHIS task1
is computed using two order vectors. Finally, the overall similarity    is as follows, where the user query, training dataset and test dataset
is derived by combining order similarity and semantic similarity.       are taken as inputs. On the given inputs pre-processing steps has
                                                                        been applied. The pre-processing steps include stop word
                                                                        elimination and all the letters in the query and sentences were
To find the relevance nature of the sentence to the given query, all
                                                                        converted into lowercase before performing actual tasks of CHIS.
these values of different similarity were averaged. A threshold was
kept to say the sentences are relevant if they fall above these
threshold value and irrelevant if they fall below the threshold.        The Relevance identification task is to find whether a given
                                                                        sentence is relevant/irrelevant to the query. To achieve this task,
                                                                        similarity measure between the given query and each sentence from
3. Task 2 Support Calculation                                           the document collection is computed using each of the techniques
                                                                        namely cosine similarity, jaccard coefficient, TF-IDF similarity
                                                                        and semantic similarity shown in the diagram. The overall
With the explosive growth of the social media like twitter, face        similarity is considered as the average of the above all similarity
book, blogs and microblogging are used to search, extract               measures. After computing the overall similarity between each pair
information from these to help in decision making. As lots of           of query and sentence from the document collection, if the
information available from various sources and in diverse nature it     similarity measure exceeds the threshold value, then the sentence
becomes very difficult to identify whether the information is           in the document collection is considered as relevant else it is
supporting or opposing or neutral to the user query                     considered as irrelevant.
                                                                        After evaluating the similarity measure between query and each
Identification of support of a sentence towards the query was           pair of sentences in the document collection, the training dataset is
recognized using a special class support vector machines that uses      used to train by the C-Support Vector Machine (SVM) classifier to
“C Factor” . Basically support vector machine is a binary classifier    predict the class label for the test data. Actually the normal SVM
but to obtain the class of “neutral” we used a special category of      classier classifies only positive and negative but in order to identify
support vector machine namely C-Support Vectors Classification          the neutral nature of a given sentence we a used a special “C
which is based on libsvm[7].                                            factor” in the SVM that identifies a marginal values between upper
In these support vector machine the first step is to find convert the   and lower planes that used TF-IDF feature as a measure.
collection of given sentences in a document into Term Frequency
and Inverse Document Frequency (TF-IDF) feature vector. The
next step identify whether the feature should be made of a word or
with a character of n-gram. The lower and upper boundary of the
range of n-values were extracted for different n-grams for all values   5. Conclusion
of n such that n lies between min_n <= n <= max_n. After this step      Consumer health information search provide two tasks. Task 1 is
we need to build a vocabulary that consider top max_features            about the identification of the relevant nature of the sentence with
ordered by term frequency across the corpus. Next a learning            that of the query. Task 2 is about the identification of support
method has to be applied to learn vocabulary, IDF and return term       (positive/negative/neutral) of the sentence with respect to the
document matrix.                                                        query. A framework has been designed to achieve this tasks. To
                                                                        achieve task 1 we have computed several similarity measures to
We have used parameters C and penalty parameter of the error            find syntactic and semantic similarity between the sentence and the
term. Kernel type is used along with Radial Basis function (RBF).       query. Task 2 is a support calculation of a sentence towards the
When training SVM with RBF kernel, the parameters required is           given query. To achieve this a special type of C-support vector
C. Lower the value of C makes the decision boundary smooth and          classification is used. It uses a TF_IDF feature and incorporate the
higher the value of C makes classifying all the training examples       n gram approach to learn the vocabulary. Using these feature
vectors, the training data set is used to learn the model and is         transactions on knowledge       and   data   engineering.   2006
applied on the test data to find the support. The results obtained       Aug;18(8):1138-50.
from the CHIS organizers shown that the method adopted for               [4] Metzler D, Bernstein Y, Croft WB, Moffat A, Zobel J.
identified for finding the relevance factor in our work was not          Similarity measures for tracking information flow. InProceedings
producing effective when compared with other models submitted            of the 14th ACM international conference on Information and
for this task. In the results obtained for task 2 our model was found    knowledge management 2005 Oct 31 (pp. 517-524). ACM.
to be working better and is effective to compute the support. It stood
first when compared with the other models developed for this task.       [5] Grossman DA, Frieder O. Information retrieval: Algorithms and
                                                                         heuristics. Springer Science & Business Media; 2012 Nov 12.
                                                                         [6] Huang A. Similarity measures for text document clustering.
6. REFERENCES                                                            InProceedings of the sixth new zealand computer science research
[1] Liu B. Sentiment analysis and opinion mining. Synthesis              student conference (NZCSRSC2008), Christchurch, New Zealand
lectures on human language technologies. 2012 May 22;5(1):1-67.          2008 Apr 14 (pp. 49-56).
[2] Pang B, Lee L. Opinion mining and sentiment analysis.                [7] Meyer D, Wien FT. Support vector machines. The Interface to
Foundations and trends in information retrieval. 2008 Jan 1;2(1-         libsvm in package e1071. 2015 Aug 5.
2):1-35.
[3] Li Y, McLean D, Bandar ZA, O'shea JD, Crockett K. Sentence
similarity based on semantic nets and corpus statistics. IEEE