A Hybrid Tweet Contextualization System using IR and Summarization Pinaki Bhaskar, Somnath Banerjee, and Sivaji Bandyopadhyay Department of Computer Science and Engineering, Jadavpur University, Kolkata - 700032, India {pinaki.bhaskar, s.banerjee1980}@gmail.com, sivaji_cse_ju@yahoo.com Abstract. The article presents the experiments carried out as part of the participation in the Tweet Contextualization (TC) track of INEX 2012. We have submitted three runs. The INEX TC task has two main sub tasks, Focused IR and Automatic Summarization. In the Focused IR system, we first preprocess the Wikipedia documents and then index them using Nutch with NE field. Stop words are removed and all NEs are tagged from each query tweet and all the remaining tweet words are stemmed using Porter stemmer. The stemmed tweet words form the query for retrieving the most relevant document using the index. The automatic summarization system takes as input the query tweet along with the title from the most relevant text document. Most relevant sentences are retrieved from the associated document based on the TF-IDF of the matching query tweet, NEs text and title words. Each retrieved sentence is assigned a ranking score in the Automatic Summarization system. The answer passage includes the top ranked retrieved sentences with a limit of 500 words. The three unique runs differ in the way in which the relevant sentences are retrieved from the associated document. Keywords: Information Retrieval, Automatic Summarization, Question Answering, Information Extraction, INEX 2012 1 Introduction With the explosion of information in Internet, Natural language Question Answering (QA) is recognized as a capability with great potential. Traditionally, QA has attracted many AI researchers, but most QA systems developed are toy systems or games confined to laboratories and to a very restricted domain. Several recent conferences and workshops have focused on aspects of the QA research. Starting in 1999, the Text Retrieval Conference (TREC)1 has sponsored a question-answering track, which evaluates systems that answer factual questions by consulting the documents of the TREC corpus. A number of systems in this evaluation have successfully combined information retrieval and natural language processing 1 http://trec.nist.gov/ A Hybrid Tweet Contextualization System using IR and Summarization techniques. More recently, Conference and Labs of Evaluation Forums (CLEF)2 are organizing QA lab from 2010. INEX3 has also started Question Answering track. Last year, INEX 2011 designed a QA track [1] to stimulate the research for real world application. The Question Answering (QA) task performed by the participating groups of INEX 2011 is contextualizing tweets, i.e., answering questions of the form "what is this tweet about?" using a recent cleaned dump of the Wikipedia (April 2011). This year they renamed this task as Tweet Contextualization. Current INEX 2012 Tweet Contextualization (TC) track gives QA research a new direction by fusing IR and summarization with QA. The TC track of INEX 2012 had two major sub tasks. The first task is to identify the most relevant document from the Wikipedia dump, for this we need a focused IR system. And the second task is to extract most relevant passages from the most relevant retrieved document. So we need an automatic summarization system. The general purpose of the task involves tweet analysis, passage and/or XML elements retrieval and construction of the answer, more specifically, the summarization of the tweet topic. Automatic text summarization [2] has become an important and timely tool for assisting and interpreting text information in today’s fast-growing information age. Text Summarization methods can be classified into abstractive and extractive summarization. An Abstractive Summarization ([3] and [4]) attempts to develop an understanding of the main concepts in a document and then expresses those concepts in clear natural language. Extractive Summaries [5] are formulated by extracting key text segments (sentences or passages) from the text, based on statistical analysis of individual or mixed surface level features such as word/phrase frequency, location or cue words to locate the sentences to be extracted. Our approach is based on Extractive Summarization. In this paper, we describe a hybrid Tweet Contextualization system of focused IR and automatic summarization for TC track of INEX 2012. The focused IR system is based on Nutch architecture and the automatic summarization system is based on TF- IDF based sentence ranking and sentence extraction techniques. The same sentence scoring and ranking approach of [6] and [7] has been followed. We have submitted three runs in the QA track (177, 191 and 192). 2 Related Works Recent trend shows hybrid approach of tweet contextualization using Information Retrieval (IR) can improve the performance of the TC system. Reference [8] removed incorrect answers of QA system using an IR engine. Reference [9] successfully used methods of IR into QA system. Reference [10] used the IR system into QA and [11] proposed an efficient hybrid QA system using IR in QA. Reference [12] presents an investigation into the utility of document summarization in the context of IR, more specifically in the application of so-called query-biased summaries: summaries customized to reflect the information need 2 http://www.clef-initiative.eu// 3 https://inex.mmci.uni-saarland.de/ A Hybrid Tweet Contextualization System using IR and Summarization expressed in a query. Employed in the retrieved document list displayed after retrieval took place, the summaries’ utility was evaluated in a task-based environment by measuring users’ speed and accuracy in identifying relevant documents. This was compared to the performance achieved when users were presented with the more typical output of an IR system: a static predefined summary composed of the title and first few sentences of retrieved documents. The results from the evaluation indicate that the use of query-biased summaries significantly improves both the accuracy and speed of user relevance judgments. A lot of research work has been done in the domain of both query dependent and independent summarization. MEAD [13] is a centroid based multi document summarizer, which generates summaries using cluster centroids produced by topic detection and tracking system. NeATS [14] selects important content using sentence position, term frequency, topic signature and term clustering. XDoX [15] identifies the most salient themes within the document set by passage clustering and then composes an extraction summary, which reflects these main themes. Graph based methods have been also proposed for generating summaries. A document graph based query focused multi-document summarization system has been described by [16], [6] and [7]. In the present work, we have used the IR system as described in [10], [11] and [17] and the automatic summarization system as discussed in [6], [7] and [17]. In the later part of this paper, section 3 describes the corpus statistics and section 4 shows the system architecture of combined TC system of focused IR and automatic summarization for INEX 2012. Section 5 details the Focused Information Retrieval system architecture. Section 6 details the Automatic Summarization system architecture. The evaluations carried out on submitted runs are discussed in Section 7 along with the evaluation results. The conclusions are drawn in Section 8. 3 Corpus statistics The training data is the collection of documents that has been rebuilt based on recent English Wikipedia dump (November 2011). All notes and bibliographic references have been removed from Wikipedia pages to prepare plain xml corpus for an easy extraction of plain text answers. Each training document is made of a title, an abstract and sections. Each section has a sub-title. Abstract and sections are made of paragraphs and each paragraph can have entities that refer to Wikipedia pages. Therefore, the resulting corpus has this simple DTD as shown in table 1. Test data is made up of 1142 tweets from Twitter. There are two different formats of tweets, one is the full JSON format with all tweet metadata as shown in the table 2 and another is the two-column text format with only tweet id and tweet text as shown in the table 3. A Hybrid Tweet Contextualization System using IR and Summarization Table 1. The DTD for Wikipedia pages Table 2. A full JSON format with all tweet metadata of INEX 2012 test corpus "created_at":"Fri, 03 Feb 2012 09:10:20 +0000", "from_user":"XXX", "from_user_id":XXX, "from_user_id_str":"XXX", "from_user_name":"XXX", "geo":null, "id":XXX, "id_str":"XXX", "iso_language_code":"en", "metadata":{"result_type":"recent"}, "profile_image_url":"http://XXX", "profile_image_url_https":"https://XXX", "source":"", "text":"blahblahblah", "to_user":null, "to_user_id":null, "to_user_id_str":null, "to_user_name":null Table 3. A two-column text format with only tweet id and tweet text of INEX 2012 test corpus Tweet Id Tweet Text "What links human rights, biodiversity and habitat loss, deforestation, pollution, 170167036520038400 pesticides, Rio +20 and and a sustainable future for all?" 4 System Architecture In this section the overview of the system framework of the current INEX system has been shown. The current INEX system has two major sub-systems; one is the Focused A Hybrid Tweet Contextualization System using IR and Summarization IR system and the other one is the Automatic Summarization system. The Focused IR system has been developed on the basic architecture of Nutch4, which use the architecture of Lucene5. Nutch is an open source search engine, which supports only the monolingual Information Retrieval in English, etc. The Higher-level system architecture of the combined Tweet Contextualization system of Focused IR and Automatic Summarization is shown in the Figure 1. Fig. 1. Higher level system architecture of current INEX system 5 Focused Information Retrieval (IR) 5.1 Wikipedia Document Parsing and Indexing The web documents are full of noises mixed with the original content. In that case it is very difficult to identify and separate the noises from the actual content. INEX 2012 corpus, i.e., Wikipedia dump, had some noise in the documents and the documents are 4 http://nutch.apache.org/ 5 http://lucene.apache.org/ A Hybrid Tweet Contextualization System using IR and Summarization in XML tagged format. So, first of all, the documents had to be preprocessed. The document structure is checked and reformatted according to the system requirements. XML Parser. The corpus was in XML format. All the XML test data has been parsed before indexing using our XML Parser. The XML Parser extracts the Title of the document along with the paragraphs. Noise Removal. The corpus has some noise as well as some special symbols that are not necessary for our system. The list of noise symbols and the special symbols is initially developed manually by looking at a number of documents and then the list is used to automatically remove such symbols from the documents. Some examples are “"”, “&”, “'''”, multiple spaces etc. Named Entity Recognizer (NER). After cleaning the corpus, the named entity recognizer identifies all the named entities (NE) in the documents and tags them according to their types, which are indexed during the document indexing. Document Indexing. After parsing the Wikipedia documents, they are indexed using Lucene, an open source indexer. 5.2 Tweets Parsing After indexing has been done, the tweets had to be processed to retrieve relevant documents. Each tweet / topic was processed to identify the query words for submission to Lucene. The tweets processing steps are described below: Stop Word Removal. In this step the tweet words are identified from the tweets. The stop words and question words (what, when, where, which etc.) are removed from each tweet and the words remaining in the tweets after the removal of such words are identified as the query tokens. The stop word list used in the present work can be found at http://members.unine.ch/jacques.savoy/clef/. Named Entity Recognizer (NER). After removing the stop words, the named entity recognizer identifies all the named entities (NE) in the tweet and tags them according to their types, which are used during the scoring of the sentences of the retrieved document. Stemming. Query tokens may appear in inflected forms in the tweets. For English, standard Porter Stemming algorithm6 has been used to stem the query tokens. After stemming all the query tokens, queries are formed with the stemmed query tokens. 6 http://tartarus.org/~martin/PorterStemmer/java.txt A Hybrid Tweet Contextualization System using IR and Summarization 5.3 Document Retrieval After searching each query into the Lucene index, a set of retrieved documents in ranked order for each query is received. First of all, all queries were fired with AND operator. If at least one document is retrieved using the query with AND operator then the query is removed from the query list and need not be searched again. The rest of the queries are fired again with OR operator. OR searching retrieves at least one document for each query. Now, the top ranked relevant document for each query is considered for Passage selection. Document retrieval is the most crucial part of this system. We take only the top ranked relevant document assuming that it is the most relevant document for the query or the tweet from which the query had been generated. 6 Automatic Summarization 6.1 Sentence Extraction The document text is parsed and the parsed text is used to generate the summary. This module will take the parsed text of the documents as input, filter the input parsed text and extract all the sentences from the parsed text. So this module has two sub modules, Text Filterization and Sentence Extraction. Text Filterization. The parsed text may content some junk or unrecognized character or symbol. First, these characters or symbols are identified and removed. The text in the query language are identified and extracted from the document using the Unicode character list, which has been collected from Wikipedia7. The symbols like dot (.), coma (,), single quote (‘), double quote (“), ‘!’, ‘?’ etc. are common for all languages, so these are also listed as symbols. Sentence Extraction. In Sentence Extraction module, filtered parsed text has been parsed to identify and extract all sentences in the documents. Sentence identification and extraction is not an easy task for English document. As the sentence marker ‘.’ (dot) is not only used as a sentence marker, it has other uses also like decimal point and in abbreviations like Mr., Prof., U.S.A. etc. So it creates lot of ambiguity. A possible list of abbreviation had to created to minimize the ambiguity. Most of the times the end quotation (”) is placed wrongly at the end of the sentence like .”. These kinds of ambiguities are identified and removed to extract all the sentences from the document. 7 http://en.wikipedia.org/wiki/List_of_Unicode_characters A Hybrid Tweet Contextualization System using IR and Summarization 6.2 Key Term Extraction Key Term Extraction module has three sub modules like Query Term, i.e., tweet term extraction, tweet text extraction and Title words extraction. All these three sub modules have been described in the following sections. Query/Tweet Term Extraction. First the query generated from the tweet, is parsed using the Query Parsing module. In this Query Parsing module, the Named Entities (NE) are identified and tagged in the given query using the Stanford NER8 engine. Title Word Extraction. The title of the retrieved document is extracted and forwarded as input given to the Title Word Extraction module. After removing all the stop words from the title, the remaining tile words are extracted and used as the keywords in this system. 6.3 Top Sentence Identification All the extracted sentences are now searched for the keywords, i.e., query terms, tweet’s text keywords and title words. Extracted sentences are given some weight according to search and ranked on the basis of the calculated weight. For this task this module has two sub modules: Weight Assigning and Sentence Ranking, which are described below. Weight Assigning. This sub module calculates the weights of each sentence in the document. There are three basic components in the sentence weight like query term dependent score, tweet’s text keyword dependent score and title word dependent score. These three components are calculated and added to get the final weight of a sentence. Query/Tweet Term dependent score: Query/Tweet term dependent score is the most important and relevant score for summary. Priority of this query/tweet dependent score is maximum. The query dependent scores are calculated using equation 1. nq " " " f pq !1% % % ( ) Qs = ( Fq $ 20 + nq ! q +1 $ ( $ 1! # p # ) p' N s '& '& & (1) q=1 # where, QS is the query/tweet term dependent score of the sentence s, q is the no. of the q query/tweet term, nq is the total no. of query terms, f p is the possession of the word which was matched with the query term q in the sentence s, Ns is the total no. of words in sentence s, 8 http://www-nlp.stanford.edu/ner/ A Hybrid Tweet Contextualization System using IR and Summarization 0; if query term qis not found . (2) Fq = 1; if query term qis found 5; if query term is NE and p= (3) 3; if query term is not NE At the end of the equation 1, the calculated query term dependent score is multiplied by p to give the priority among all the scores. If the query term is NE and contained in a sentence then the weight of the matched sentence are multiplied by 5 as the value of p is 5, to give the highest priority, other wise it has been multiplied by 3 (as p=3 for non NE query terms). Title Word dependent score: Title words are extracted from the title field of the top ranked retrieved document. A title word dependent score is also calculated for each sentence. Generally title words are also the much relevant words of the document. So the sentence containing any title words can be a relevant sentence of the main topic of the document. Title word dependent scores are calculated using equation 4. nt " " f t !1% % Ts = ( Ft ( nt ! t +1) $ ( $ 1 ! p ' ' (4) t=0 # p # Ns & & where, TS is the title word dependent score of the sentence s, t is the no. of the title word, nt is the total number of title words, f pt is the position of the word which matched with the title word t in the sentence s, Ns is the total number of words in sentence s and 0; if title word t is not found . (5) Ft = 1; if title word t is found After calculating all the above three scores the final weight of each sentence is calculated by simply adding all the two scores as mentioned in the equation 6. Ws = Qs + Ts (6) where, WS is the final weight of the sentence s. Sentence Ranking. After calculating weights of all the sentences in the document, sentences are sorted in descending order of their weight. In this process if any two or more than two sentences get equal weight, then they are sorted in the ascending order of their positional value, i.e., the sentence number in the document. So, this Sentence Ranking module provides the ranked sentences. 6.4 Summary Generation This is the final and most critical module of this system. This module generates the Summary from the ranked sentences. As in [13] using equation 9, the module selects A Hybrid Tweet Contextualization System using IR and Summarization the ranked sentences subject to maximum length of the summary. ∑l S < L i i i (9) where li is the length (in no. of words) of sentence i, Si is a binary variable representing the selection of sentence i for the summary and L (=500 words) is the maximum length of the summary. Now, the selected sentences along with their weight are presented as the INEX output format. 7 Evaluation 7.1 Informative Content Evaluation The organizers did the Informative Content evaluation [1] by selecting relevant passages. 50 topics were evaluated which was the pool of 14 654 sentences, 471 344 tokens, vocabulary of 59 020 words. Among them, 2801 sentences, 103889 tokens, vocabulary of 19037 words, are relevant. There are 8 topics with less than 500 relevant tokens. The evaluation measures of Information content divergences over {1,2,3,4gap}-grams (FRESA package) because it was too sensitive to smoothing on the qa-rels. So simple log difference of equation 10 was used: ! max ( P (t / reference), P (t / summary )) $ ' log #" min ( P (t / reference), P (t / summary)) &% (10) We have submitted three runs (177, 191 and 192). The evaluation scores with the baseline system scores of informativeness by organizers of all topics are shown in the table 4. Table 4. The evaluation scores of Informativeness by organizers of all topics Run Id unigram bigram Skip 192 0.9590 0.9947 0.9947 191 0.9590 0.9947 0.9947 177 0.9541 0.9981 0.9984 7.2 Readability Evaluation For Readability evaluation [1] all passages in a summary have been evaluated according to Syntax (S), Anaphora (A), Redundancy (R) and Trash (T). If a passage contains a syntactic problem (bad segmentation for example) then it has been marked A Hybrid Tweet Contextualization System using IR and Summarization as Syntax (S) error. If a passage contains an unsolved anaphora then it has been marked as Anaphora (A) error. If a passage contains any redundant information, i.e., an information that have already been given in a previous passage then it has been marked as Redundancy (R) error. If a passage does not make any sense in its context (i.e., after reading the previous passages) then these passages must be considered as trashed, and readability of following passages must be assessed as if these passages were not present, so they were marked as Trash (T). The readability evaluation scores are shown in the table 5. Table 5. The evaluation scores of Readability Run Id Relevancy Syntax Structure Nb 192 0.6020 0.6020 0.6020 2 191 0.6173 0.5540 0.5353 3 177 0.5227 0.4680 0.4680 3 8 Conclusion and Future Works The tweet contextualization system has been developed as part of the participation in the Tweet Contextualization track of the INEX 2012 evaluation campaign. The overall system has been evaluated using the evaluation metrics provided as part of this track of INEX 2012. Considering that this is the second participation in the track, the evaluation results are satisfactory, which will really encourage us to continue work on it and participate in this track in future. Future works will be motivated towards improving the performance of the system by concentrating on co-reference and anaphora resolution, multi-word identification, para phrasing, feature selection etc. In future, we will also try to use semantic similarity, which will increase our relevance score. Acknowledgements. We acknowledge the support of the IFCPAR funded Indo- French project “An Advanced Platform for Question Answering Systems” and the DIT, Government of India funded project “Development of Cross Lingual Information Access (CLIA) System Phase II”. References 1. SanJuan, E., Moriceau, V., Tannier, X., Bellot, P., Mothe, J.: Overview of the INEX 2011 Question Answering Track (QA@INEX). In: Focused Retrieval of Content and Structure, 10th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX), Geva, S., Kamps, J., Schenkel, R. (Eds.). Lecture Notes in Computer Sc., Springer (2011) 2. Jezek, K., Steinberger, J.: Automatic Text summarization. In: Snasel, V. (ed.) Znalosti 2008. ISBN 978-80-227-2827-0, pp.1--12. FIIT STU Brarislava, Ustav Informatiky a softveroveho inzinierstva (2008) 3. Erkan, G., Radev, D.R.: LexRank: Graph-based Centrality as Salience in Text Summarization. In: Journal of Artificial Intelligence Research, vol. 22, pp. 457--479 (2004) A Hybrid Tweet Contextualization System using IR and Summarization 4. Hahn, U., Romacker, M.: The SYNDIKATE text Knowledge base generator. In: the first International conference on Human language technology research, Association for Computational Linguistics , ACM, Morristown, NJ, USA (2001) 5. Kyoomarsi, F., Khosravi, H., Eslami, E., Dehkordy, P.K.: Optimizing Text Summarization Based on Fuzzy Logic. In: Seventh IEEE/ACIS International Conference on Computer and Information Science, pp. 347--352. IEEE, University of Shahid Bahonar Kerman, UK (2008) 6. Bhaskar, P., Bandyopadhyay, S.: A Query Focused Multi Document Automatic Summarization. In: the 24th Pacific Asia Conference on Language, Information and Computation (PACLIC 24), Tohoku University, Sendai, Japan (2010) 7. Bhaskar, P., Bandyopadhyay, S.: A Query Focused Automatic Multi Document Summarizer. In: the International Conference on Natural Language Processing (ICON), pp. 241--250. IIT, Kharagpur, India (2010) 8. Rodrigo, A., Iglesias, J.P., Pe˜nas, A., Garrido, G., Araujo, L.: A Question Answering System based on Information Retrieval and Validation, ResPubliQA (2010) 9. Schiffman, B., McKeown, K.R., Grishman, R., Allan, J.: Question Answering using Integrated Information Retrieval and Information Extraction. In: NAACL HLT, pp. 532-- 539 (2007) 10. Pakray, P., Bhaskar, P., Pal, S., Das, D., Bandyopadhyay, S., Gelbukh, A.: JU_CSE_TE: System Description QA@CLEF 2010 – ResPubliQA. In: Multiple Language Question Answering (MLQA 2010), CLEF-2010, Padua, Italy (2010) 11. Pakray, P., Bhaskar, P., Banerjee, S., Pal, B.C., Bandyopadhyay, S., Gelbukh, A.: A Hybrid Question Answering System based on Information Retrieval and Answer Validation. In: Question Answering for Machine Reading Evaluation (QA4MRE), CLEF-2011, Amsterdam (2011) 12. Tombros, A., Sanderson, M.: Advantages of Query Biased Summaries in Information Retrieval. In: SIGIR (1998) 13. Radev, D.R., Jing, H., Styś, M., Tam, D.: Centroid- based summarization of multiple documents. J. Information Processing and Management. 40, 919–938 (2004) 14. Lin, C.Y., Hovy, E.H.: From Single to Multidocument Summarization: A Prototype System and its Evaluation. In: ACL, pp. 457--464 (2002) 15. Hardy, H., Shimizu, N., Strzalkowski, T., Ting, L., Wise, G. B., Zhang. X.: Cross-document summarization by concept classification. In: SIGIR, pp. 65--69 (2002) 16. Paladhi, S., Bandyopadhyay, S.: A Document Graph Based Query Focused Multi- Document Summarizer. In: the 2nd International Workshop on Cross Lingual Information Access (CLIA), pp. 55-62 (2008) 17. Bhaskar, P., Banerjee, S., Neogi, S., Bandyopadhyay, S.: A Hybrid QA System with Focused IR and Automatic Summarization for INEX 2011. In: Geva, S., Kamps, J., Schenkel, R.(eds.): Focused Retrieval of Content and Structure: 10th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2011. Lecture Notes in Computer Science, vol. 7424. Springer Verlag, Berlin, Heidelberg (2012)