=Paper= {{Paper |id=Vol-1737/T5-4 |storemode=property |title=Relevance and Support calculation for Health information |pdfUrl=https://ceur-ws.org/Vol-1737/T5-4.pdf |volume=Vol-1737 |authors=S. Suresh Kumar,L Naveen |dblpUrl=https://dblp.org/rec/conf/fire/KumarN16 }} ==Relevance and Support calculation for Health information== https://ceur-ws.org/Vol-1737/T5-4.pdf

Relevance and Support calculation for Health information
S. Suresh Kumar L Naveen
Department of Information Technology Department of Information Technology
Assistant Professor Assistant Professor
JNTU Hyderabad BVRIT
Hyderabad Hyderabad
9440936885
sureshsanampudi@gmail.com

ABSTRACT the target objects. Choosing an appropriate similarity measure is
Consumer health information search (CHIS) is a forum of also important for information retrieval task. In general, similarity
information retrieval that has organized two tasks to be performed. measures plot the distance or similarity between the symbolic
The first task includes the identification that whether a given query descriptions of two objects into a single numeric value[5]. Several
is relevant or irrelevant to the sentences available in the document. syntactic similarity measures have been implemented in our model
The second tasks talks about finding the nature of support of a for task 1, few of them are: Euclidean distance, cosine similarity,
sentence in the document to the query. jaccard coefficient.

Task 1 i.e identification of relevance is done by performing In implementing CHIS task-1 we have used cosine similarity,
different similarity measures calculations and finally averaging the jaccard coefficient, TF-IDF similarity to find the relevance with
obtained score to find the relevance nature of a sentence. The respect to syntactic nature of sentence and semantic similarity to
second task is solved by using a special type of support vector find semantical relevance. The average score of the obtained scores
machine called C-support vector machines that can handle were calculated to find the syntactic and semantic similarity of a
multiclass support. The results obtained from CHIS organizers given sentence with respect to that of query.
shown that model developed for second task shown promising
results than that of the model developed for the first task.
2.1 Cosine similarity
In this measure computation the sentences and query are
Keywords represented as term vectors, the similarity is quantified as cosine
C Support Vector Machines, TF-IDF, Jaccard coefficient, cosine angle between the query and a sentences vector that is, so-called
similarity. cosine similarity. Cosine similarity is the most widespread
similarity measures applied to check similarity between texts.
1. INTRODUCTION Given a Sentence collection (S) and query (Q), the similarity
In this work we explore the Consumer Health Information Search coefficient between them is computed using following formula:
(CHIS) task for finding the relevant information for the user query
from the given collection of sentence dataset. The task 1 𝑆⃗ . 𝑄
⃗⃗
implementation of CHIS aims to identify whether a given sentence 𝑆𝐶1(𝑆⃗, 𝑄
⃗⃗) =
is relevant to the query or not. Whereas task 2 aims at the |𝑆⃗| 𝑋 |𝑄
⃗⃗|
identification of whether support nature of the sentence with respect
to the query. In Task 1 of CHIS, retrieve the relevant information Where 𝑆⃗ 𝑎𝑛𝑑 𝑄
⃗⃗ are vector representation of sentence and query.
related to the user query, different similarity measures [5][6] have
been used. The similarity coefficient was computed between the
2.2 Jaccard Coefficient
The Jaccard coefficient, finds the Similarity measures between
query and sentences given in the document collection. The average
finite sample sets. It is defined as the cardinality of the intersection
of these coefficients was identified and based on that value it is
of sets divided by the cardinality of the union of the sample sets[3].
decided whether the sentence is relevant to query or not. Task 2 of
For text similarity jaccard coefficient compares the sum of weight
CHIS aim is to identify whether each of the sentence in the given
of shared terms to the sum of weights terms that are present in either
document collection is supporting or opposing or neutral to the
of the document but are not shared terms. The formal definition is:
claim made in the query. It was treated using a special type of
support vector machine that includes the c-factor[7]. 𝑆⃗ . 𝑄
⃗⃗
𝑆𝐶2(𝑆⃗, 𝑄
⃗⃗ ) =
2 2
The rest of the paper is organized as follows. Section 2 discuss |𝑆⃗| + |𝑄
⃗⃗ | − 𝑆⃗ . 𝑄
⃗⃗
about various approaches used to find the relevance computation
between the query and each sentence from document collection. Where 𝑆⃗ 𝑎𝑛𝑑 𝑄
⃗⃗ are vector representation of sentence and query.
Section 3 explain implementation of achieving the support nature
of a sentence with respect to query. Section 4 elaborates the dataset 2.3 TF-IDF Similarity
description and queries used for search. Section 5 concludes the TF-IDF measures are a broad class of functions that are used for
paper. computing similarity and relevance between queries and
documents. The basic idea is that, the more frequently a word
2. TASK-1 Relevance Identification appears in text, the more indicative that word is of the topicality of
To retrieve the relevant collection of sentences to the query, we the text; and that the less frequently a word appears in a document
have calculated the similarity measures between the given query collection, the greater its power to categorize between relevant or
and the sentence collection. Similarity between query and sentence irrelevant.
collection was computed both in syntactic and semantic aspect. The
The similarity function:
similarity measure reflects the degree of closeness or separation of
𝑁+1 correctly. C-SVM method provide a grid search method that
𝑆𝐶(𝑆, 𝑄) = ∑ log(𝑡𝑓𝑤,𝑄 + 1) log(𝑡𝑓𝑤,𝑆 + 1) log ( )
𝑑𝑓𝑤 + 0.5 implements a fit and score method that includes various function to
𝑤∈𝑄∩𝑅
be implemented such as probability prediction, decision functions
Where tfw,Q is the number of times word w appears in query and perform transformation and inverse transformation.
sentence Q; tfw,S is the number of times word w appears in sentence
C-SVM is considered as supervised learning task, in which a
S; N is the total number of sentences in the collection; and dfw is
model is built to learn from the training data and to predict the class
the number of sentences that w appears in.
label for the unseen data. Support Vectors constructs a hyperplane
2.4 Semantic Similarity or set of hyperplanes in high dimensional space which is used for
identification of support.
Semantic similarity measure the text similarity that is derived from
semantic and syntactic information contained in the given texts. To
compute the similarity, for each sentence, a raw semantic vector is
derived with the help of lexical database; and also a word order 4. Implementation Model
vector is formed for each of the sentences using the same For implementing CHIS task, the organizing committee has been
information from lexical database. Each word in a sentence given with training dataset of five documents in each
contributes differently to the complete meaning of the whole approximately with 300 sentences and queries for retrieval process.
sentence, the importance of a word is weighted by using Training dataset consisting of total of three attributes, in which
information content resultant from corpus by combining the raw attribute1 consists of sentence, attribute2 consists of relevant or
semantic with information from the corpus, a semantic vector is irrelevant and attribute3 consists of polarity of the sentence towards
found for each of the two sentences[3]. Semantic similarity is the query as oppose, support or neutral.
computed based on the two semantic vectors. An order similarity The steps followed to complete the implementation of CHIS task1
is computed using two order vectors. Finally, the overall similarity is as follows, where the user query, training dataset and test dataset
is derived by combining order similarity and semantic similarity. are taken as inputs. On the given inputs pre-processing steps has
been applied. The pre-processing steps include stop word
elimination and all the letters in the query and sentences were
To find the relevance nature of the sentence to the given query, all
converted into lowercase before performing actual tasks of CHIS.
these values of different similarity were averaged. A threshold was
kept to say the sentences are relevant if they fall above these
threshold value and irrelevant if they fall below the threshold. The Relevance identification task is to find whether a given
sentence is relevant/irrelevant to the query. To achieve this task,
similarity measure between the given query and each sentence from
3. Task 2 Support Calculation the document collection is computed using each of the techniques
namely cosine similarity, jaccard coefficient, TF-IDF similarity
and semantic similarity shown in the diagram. The overall
With the explosive growth of the social media like twitter, face similarity is considered as the average of the above all similarity
book, blogs and microblogging are used to search, extract measures. After computing the overall similarity between each pair
information from these to help in decision making. As lots of of query and sentence from the document collection, if the
information available from various sources and in diverse nature it similarity measure exceeds the threshold value, then the sentence
becomes very difficult to identify whether the information is in the document collection is considered as relevant else it is
supporting or opposing or neutral to the user query considered as irrelevant.
After evaluating the similarity measure between query and each
Identification of support of a sentence towards the query was pair of sentences in the document collection, the training dataset is
recognized using a special class support vector machines that uses used to train by the C-Support Vector Machine (SVM) classifier to
“C Factor” . Basically support vector machine is a binary classifier predict the class label for the test data. Actually the normal SVM
but to obtain the class of “neutral” we used a special category of classier classifies only positive and negative but in order to identify
support vector machine namely C-Support Vectors Classification the neutral nature of a given sentence we a used a special “C
which is based on libsvm[7]. factor” in the SVM that identifies a marginal values between upper
In these support vector machine the first step is to find convert the and lower planes that used TF-IDF feature as a measure.
collection of given sentences in a document into Term Frequency
and Inverse Document Frequency (TF-IDF) feature vector. The
next step identify whether the feature should be made of a word or
with a character of n-gram. The lower and upper boundary of the
range of n-values were extracted for different n-grams for all values 5. Conclusion
of n such that n lies between min_n <= n <= max_n. After this step Consumer health information search provide two tasks. Task 1 is
we need to build a vocabulary that consider top max_features about the identification of the relevant nature of the sentence with
ordered by term frequency across the corpus. Next a learning that of the query. Task 2 is about the identification of support
method has to be applied to learn vocabulary, IDF and return term (positive/negative/neutral) of the sentence with respect to the
document matrix. query. A framework has been designed to achieve this tasks. To
achieve task 1 we have computed several similarity measures to
We have used parameters C and penalty parameter of the error find syntactic and semantic similarity between the sentence and the
term. Kernel type is used along with Radial Basis function (RBF). query. Task 2 is a support calculation of a sentence towards the
When training SVM with RBF kernel, the parameters required is given query. To achieve this a special type of C-support vector
C. Lower the value of C makes the decision boundary smooth and classification is used. It uses a TF_IDF feature and incorporate the
higher the value of C makes classifying all the training examples n gram approach to learn the vocabulary. Using these feature
vectors, the training data set is used to learn the model and is transactions on knowledge and data engineering. 2006
applied on the test data to find the support. The results obtained Aug;18(8):1138-50.
from the CHIS organizers shown that the method adopted for [4] Metzler D, Bernstein Y, Croft WB, Moffat A, Zobel J.
identified for finding the relevance factor in our work was not Similarity measures for tracking information flow. InProceedings
producing effective when compared with other models submitted of the 14th ACM international conference on Information and
for this task. In the results obtained for task 2 our model was found knowledge management 2005 Oct 31 (pp. 517-524). ACM.
to be working better and is effective to compute the support. It stood
first when compared with the other models developed for this task. [5] Grossman DA, Frieder O. Information retrieval: Algorithms and
heuristics. Springer Science & Business Media; 2012 Nov 12.
[6] Huang A. Similarity measures for text document clustering.
6. REFERENCES InProceedings of the sixth new zealand computer science research
[1] Liu B. Sentiment analysis and opinion mining. Synthesis student conference (NZCSRSC2008), Christchurch, New Zealand
lectures on human language technologies. 2012 May 22;5(1):1-67. 2008 Apr 14 (pp. 49-56).
[2] Pang B, Lee L. Opinion mining and sentiment analysis. [7] Meyer D, Wien FT. Support vector machines. The Interface to
Foundations and trends in information retrieval. 2008 Jan 1;2(1- libsvm in package e1071. 2015 Aug 5.
2):1-35.
[3] Li Y, McLean D, Bandar ZA, O'shea JD, Crockett K. Sentence
similarity based on semantic nets and corpus statistics. IEEE