Relevance and Support calculation for Health information S. Suresh Kumar L Naveen Department of Information Technology Department of Information Technology Assistant Professor Assistant Professor JNTU Hyderabad BVRIT Hyderabad Hyderabad 9440936885 sureshsanampudi@gmail.com ABSTRACT the target objects. Choosing an appropriate similarity measure is Consumer health information search (CHIS) is a forum of also important for information retrieval task. In general, similarity information retrieval that has organized two tasks to be performed. measures plot the distance or similarity between the symbolic The first task includes the identification that whether a given query descriptions of two objects into a single numeric value[5]. Several is relevant or irrelevant to the sentences available in the document. syntactic similarity measures have been implemented in our model The second tasks talks about finding the nature of support of a for task 1, few of them are: Euclidean distance, cosine similarity, sentence in the document to the query. jaccard coefficient. Task 1 i.e identification of relevance is done by performing In implementing CHIS task-1 we have used cosine similarity, different similarity measures calculations and finally averaging the jaccard coefficient, TF-IDF similarity to find the relevance with obtained score to find the relevance nature of a sentence. The respect to syntactic nature of sentence and semantic similarity to second task is solved by using a special type of support vector find semantical relevance. The average score of the obtained scores machine called C-support vector machines that can handle were calculated to find the syntactic and semantic similarity of a multiclass support. The results obtained from CHIS organizers given sentence with respect to that of query. shown that model developed for second task shown promising results than that of the model developed for the first task. 2.1 Cosine similarity In this measure computation the sentences and query are Keywords represented as term vectors, the similarity is quantified as cosine C Support Vector Machines, TF-IDF, Jaccard coefficient, cosine angle between the query and a sentences vector that is, so-called similarity. cosine similarity. Cosine similarity is the most widespread similarity measures applied to check similarity between texts. 1. INTRODUCTION Given a Sentence collection (S) and query (Q), the similarity In this work we explore the Consumer Health Information Search coefficient between them is computed using following formula: (CHIS) task for finding the relevant information for the user query from the given collection of sentence dataset. The task 1 𝑆⃗ . 𝑄 βƒ—βƒ— implementation of CHIS aims to identify whether a given sentence 𝑆𝐢1(𝑆⃗, 𝑄 βƒ—βƒ—) = is relevant to the query or not. Whereas task 2 aims at the |𝑆⃗| 𝑋 |𝑄 βƒ—βƒ—| identification of whether support nature of the sentence with respect to the query. In Task 1 of CHIS, retrieve the relevant information Where 𝑆⃗ π‘Žπ‘›π‘‘ 𝑄 βƒ—βƒ— are vector representation of sentence and query. related to the user query, different similarity measures [5][6] have been used. The similarity coefficient was computed between the 2.2 Jaccard Coefficient The Jaccard coefficient, finds the Similarity measures between query and sentences given in the document collection. The average finite sample sets. It is defined as the cardinality of the intersection of these coefficients was identified and based on that value it is of sets divided by the cardinality of the union of the sample sets[3]. decided whether the sentence is relevant to query or not. Task 2 of For text similarity jaccard coefficient compares the sum of weight CHIS aim is to identify whether each of the sentence in the given of shared terms to the sum of weights terms that are present in either document collection is supporting or opposing or neutral to the of the document but are not shared terms. The formal definition is: claim made in the query. It was treated using a special type of support vector machine that includes the c-factor[7]. 𝑆⃗ . 𝑄 βƒ—βƒ— 𝑆𝐢2(𝑆⃗, 𝑄 βƒ—βƒ— ) = 2 2 The rest of the paper is organized as follows. Section 2 discuss |𝑆⃗| + |𝑄 βƒ—βƒ— | βˆ’ 𝑆⃗ . 𝑄 βƒ—βƒ— about various approaches used to find the relevance computation between the query and each sentence from document collection. Where 𝑆⃗ π‘Žπ‘›π‘‘ 𝑄 βƒ—βƒ— are vector representation of sentence and query. Section 3 explain implementation of achieving the support nature of a sentence with respect to query. Section 4 elaborates the dataset 2.3 TF-IDF Similarity description and queries used for search. Section 5 concludes the TF-IDF measures are a broad class of functions that are used for paper. computing similarity and relevance between queries and documents. The basic idea is that, the more frequently a word 2. TASK-1 Relevance Identification appears in text, the more indicative that word is of the topicality of To retrieve the relevant collection of sentences to the query, we the text; and that the less frequently a word appears in a document have calculated the similarity measures between the given query collection, the greater its power to categorize between relevant or and the sentence collection. Similarity between query and sentence irrelevant. collection was computed both in syntactic and semantic aspect. The The similarity function: similarity measure reflects the degree of closeness or separation of 𝑁+1 correctly. C-SVM method provide a grid search method that 𝑆𝐢(𝑆, 𝑄) = βˆ‘ log(𝑑𝑓𝑀,𝑄 + 1) log(𝑑𝑓𝑀,𝑆 + 1) log ( ) 𝑑𝑓𝑀 + 0.5 implements a fit and score method that includes various function to π‘€βˆˆπ‘„βˆ©π‘… be implemented such as probability prediction, decision functions Where tfw,Q is the number of times word w appears in query and perform transformation and inverse transformation. sentence Q; tfw,S is the number of times word w appears in sentence C-SVM is considered as supervised learning task, in which a S; N is the total number of sentences in the collection; and dfw is model is built to learn from the training data and to predict the class the number of sentences that w appears in. label for the unseen data. Support Vectors constructs a hyperplane 2.4 Semantic Similarity or set of hyperplanes in high dimensional space which is used for identification of support. Semantic similarity measure the text similarity that is derived from semantic and syntactic information contained in the given texts. To compute the similarity, for each sentence, a raw semantic vector is derived with the help of lexical database; and also a word order 4. Implementation Model vector is formed for each of the sentences using the same For implementing CHIS task, the organizing committee has been information from lexical database. Each word in a sentence given with training dataset of five documents in each contributes differently to the complete meaning of the whole approximately with 300 sentences and queries for retrieval process. sentence, the importance of a word is weighted by using Training dataset consisting of total of three attributes, in which information content resultant from corpus by combining the raw attribute1 consists of sentence, attribute2 consists of relevant or semantic with information from the corpus, a semantic vector is irrelevant and attribute3 consists of polarity of the sentence towards found for each of the two sentences[3]. Semantic similarity is the query as oppose, support or neutral. computed based on the two semantic vectors. An order similarity The steps followed to complete the implementation of CHIS task1 is computed using two order vectors. Finally, the overall similarity is as follows, where the user query, training dataset and test dataset is derived by combining order similarity and semantic similarity. are taken as inputs. On the given inputs pre-processing steps has been applied. The pre-processing steps include stop word elimination and all the letters in the query and sentences were To find the relevance nature of the sentence to the given query, all converted into lowercase before performing actual tasks of CHIS. these values of different similarity were averaged. A threshold was kept to say the sentences are relevant if they fall above these threshold value and irrelevant if they fall below the threshold. The Relevance identification task is to find whether a given sentence is relevant/irrelevant to the query. To achieve this task, similarity measure between the given query and each sentence from 3. Task 2 Support Calculation the document collection is computed using each of the techniques namely cosine similarity, jaccard coefficient, TF-IDF similarity and semantic similarity shown in the diagram. The overall With the explosive growth of the social media like twitter, face similarity is considered as the average of the above all similarity book, blogs and microblogging are used to search, extract measures. After computing the overall similarity between each pair information from these to help in decision making. As lots of of query and sentence from the document collection, if the information available from various sources and in diverse nature it similarity measure exceeds the threshold value, then the sentence becomes very difficult to identify whether the information is in the document collection is considered as relevant else it is supporting or opposing or neutral to the user query considered as irrelevant. After evaluating the similarity measure between query and each Identification of support of a sentence towards the query was pair of sentences in the document collection, the training dataset is recognized using a special class support vector machines that uses used to train by the C-Support Vector Machine (SVM) classifier to β€œC Factor” . Basically support vector machine is a binary classifier predict the class label for the test data. Actually the normal SVM but to obtain the class of β€œneutral” we used a special category of classier classifies only positive and negative but in order to identify support vector machine namely C-Support Vectors Classification the neutral nature of a given sentence we a used a special β€œC which is based on libsvm[7]. factor” in the SVM that identifies a marginal values between upper In these support vector machine the first step is to find convert the and lower planes that used TF-IDF feature as a measure. collection of given sentences in a document into Term Frequency and Inverse Document Frequency (TF-IDF) feature vector. The next step identify whether the feature should be made of a word or with a character of n-gram. The lower and upper boundary of the range of n-values were extracted for different n-grams for all values 5. Conclusion of n such that n lies between min_n <= n <= max_n. After this step Consumer health information search provide two tasks. Task 1 is we need to build a vocabulary that consider top max_features about the identification of the relevant nature of the sentence with ordered by term frequency across the corpus. Next a learning that of the query. Task 2 is about the identification of support method has to be applied to learn vocabulary, IDF and return term (positive/negative/neutral) of the sentence with respect to the document matrix. query. A framework has been designed to achieve this tasks. To achieve task 1 we have computed several similarity measures to We have used parameters C and penalty parameter of the error find syntactic and semantic similarity between the sentence and the term. Kernel type is used along with Radial Basis function (RBF). query. Task 2 is a support calculation of a sentence towards the When training SVM with RBF kernel, the parameters required is given query. To achieve this a special type of C-support vector C. Lower the value of C makes the decision boundary smooth and classification is used. It uses a TF_IDF feature and incorporate the higher the value of C makes classifying all the training examples n gram approach to learn the vocabulary. Using these feature vectors, the training data set is used to learn the model and is transactions on knowledge and data engineering. 2006 applied on the test data to find the support. The results obtained Aug;18(8):1138-50. from the CHIS organizers shown that the method adopted for [4] Metzler D, Bernstein Y, Croft WB, Moffat A, Zobel J. identified for finding the relevance factor in our work was not Similarity measures for tracking information flow. InProceedings producing effective when compared with other models submitted of the 14th ACM international conference on Information and for this task. In the results obtained for task 2 our model was found knowledge management 2005 Oct 31 (pp. 517-524). ACM. to be working better and is effective to compute the support. It stood first when compared with the other models developed for this task. [5] Grossman DA, Frieder O. Information retrieval: Algorithms and heuristics. Springer Science & Business Media; 2012 Nov 12. [6] Huang A. Similarity measures for text document clustering. 6. REFERENCES InProceedings of the sixth new zealand computer science research [1] Liu B. Sentiment analysis and opinion mining. Synthesis student conference (NZCSRSC2008), Christchurch, New Zealand lectures on human language technologies. 2012 May 22;5(1):1-67. 2008 Apr 14 (pp. 49-56). [2] Pang B, Lee L. Opinion mining and sentiment analysis. [7] Meyer D, Wien FT. Support vector machines. The Interface to Foundations and trends in information retrieval. 2008 Jan 1;2(1- libsvm in package e1071. 2015 Aug 5. 2):1-35. [3] Li Y, McLean D, Bandar ZA, O'shea JD, Crockett K. Sentence similarity based on semantic nets and corpus statistics. IEEE