DCU@INEX-2012: Exploring Sentence Retrieval for Tweet Contextualization Debasis Ganguly, Johannes Leveling, and Gareth J. F. Jones CNGL, School of Computing, Dublin City University, Dublin 9, Ireland {dganguly, jleveling, gjones}@computing.dcu.ie Abstract. For the participation of Dublin City University (DCU) in the INEX-2012 tweet contextualization task, we investigated sentence retrieval methodologies. The task requires providing the context to an ad-hoc real-life tweet. This context is to be constructed from Wikipedia articles. Our approach involves indexing the passages in Wikipedia ar- ticles as separate retrievable units, extracting sentences from the top ranked passages, computing the sentence selection score for each such sentence with respect to the query, and then returning the top most sim- ilar ones. The simple sentence selection strategy performed quite well in the task. Our best run has ranked first from the readability perspective and ranked eighth as ordered by informativeness out of 33 official runs. 1 Introduction The tweet contextualization task was first introduced at INEX in 2011. The task requires construction of a short summary so as to explain the context asso- ciated with a given tweet. This context information has to be constructed from Wikipedia articles. As an example, for the CNN tweet “RT @CNNLive: View stake-out camera at funeral home where #WhitneyHouston body is expected to arrive in New Jersey. Watch live: http://t.co/nyqT4PUa”, the system is ex- pected to provide such expository information as who is Whitney Houston, what is she famous for, how did she die etc. The task being different from standard ad-hoc IR poses with its own set of challenges. Firstly, the tweet text is very different from keyword based queries of ad-hoc search or web search. This necessiatates applying pre-processing steps on the tweet texts to get an appropriate query string. For example, the tweet hash-tags do not exist in Wikipedia articles and needs to be appropriately pro- cessed to get a useful query term. Secondly, a standard passage retrieval may not be suitable for the task because of the restriction on the length in the re- ported summary. The text in a passage itself may surpass the length theshold requirement of the summary. It thus makes sense to decompose passages into smaller units, i.e. sentences, and collate them together. Previous approaches to INEX-QA have mostly used passage retrieval coupled with a summarizer. Sentence retrieval on the other hand has widely been em- ployed in TREC-QA tasks for both factoid and definition question answering [1]. Sentence retrieval has the potential to perform well for tweet contextualization because sentences being short contain more focussed information than the rel- atively larger passages which may contain digressory content. Furthermore, the effect of sentence retrieval on tweet contextualization has still been unexplored. This motivated us to apply various sentence retrieval strategies on the tweet contextualization task. 2 System Description In this section, we describe our system details. After describing the document and query processing, and retrieval, we focus on to our working methodologies for sentence selection. 2.1 Document Indexing We used a modified version of the SMART1 system for the experiments at INEX 2012. Each paragraph from the Wikipedia corpus2 was indexed as a retrievable document unit. The beginning of a passage is marked by the XML tag

. This resulted in a total of over 26M passages to retrieve from. Extracted por- tions of documents, namely text under the , <p>, <h>, <t> tags, were indexed using single terms and a controlled vocabulary (or pre-defined set) of statistical phrases following Salton’s blueprint for automatic indexing [2]. Stop- words that occur in the standard stop-word list included within SMART were removed. Words were stemmed using a variation of the Lovins’ stemmer imple- mented within SMART. Frequently occurring word bi-grams (loosely referred to as phrases) were also used as indexing units. We used the N-gram Statistics Pack- age (NSP)3 on the English Wikipedia text corpus from INEX 2006 and selected the 100,000 most frequent word bi-grams as the list of candidate phrases. 2.2 Query Processing The tweet texts were pre-processed to produce queries to retrieve against the indexed collection. The pre-processing steps are described as follows. The URLs from the tweets were removed employing a regular expression based pattern matcher. Medial capital words, i.e. words with inner uppercase letters, were split into separate words e.g. the word “WhitneyHouston” was decomposed into “Whitney” and “Houston”. Tweet hash-tags were split up into the prefix # character followed by the word, e.g. “#Whitney” was decomposed into # and “Whitney”. The following word breaking rules were applied to split hashtags starting with “#” and usernames starting with “@”: A break between the last and current character is employed if: 1 ftp://ftp.cs.cornell.edu/pub/smart/ 2 http://dev.termwatch.es/esj/Term2IR/2012/data/tweetcontext2012corpus. xml.gz 3 http://www.d.umn.edu/~tpederse/nsp.html i) the last character is lower case and the current character is upper case or digit (e.g. “OccupyWallStreet” ->“Occupy Wall Street”); ii) the last character is upper case and the last character of a valid acronym, the current character is also upper case or a digit (e.g. “CNNNews” -> “CNN News”; iii) the last character and the current character have different case and the resulting word would be longer than 3 characters. 2.3 Retrieval The context for each tweet was constructed in two passes as follows. In the first pass, we retrieved N passages using language modelling (LM) [3] similarity with Jelinek-Mercer smoothing. The smoothing parameter λ was set to 0.6. In the sec- ond pass, we score senteneces based on three different methodologies, explained later in details. We then concatenate the top M sentences until the length of the concatenated summary string exceeds the threshold of 500 characters limit. The concatenation step ensures that we do not add duplicate sentences in the summary. 2.4 Sentence Retrieval Methodologies Language Modelling Similarity. The most simple sentence scoring technique is that of scoring a sentence S by its LM score computed with respect to the query i.e. the pre-processed tweet text. This is done as shown in Equation 1. Y P (S|Q) ∝ µP (q|S) + (1 − µ)P (q) (1) q∈Q Note that the smooting parameter µ used in Equation 1 is different from λ which was used for retrieving the passages as discussed in Section 2.3. Relevance Model Similarity. The second sentence selection strategy which we use is derived from relevance model (RLM) term scores [4]. The key idea in RLM-based retrieval is that relevant documents and query terms are assumed to be sampled from an underlying hypothetical model of relevance R pertaining to the information need expressed in the query. In the absence of training data for the relevant set of documents, the only observable variables are the query terms and the top-ranked R pseudo-relevant documents assumed to be generated from the relevance model. Thus, the estimation of the probability of a word w being generated from the relevance model is approximated by the conditional probability of observing w given the observed query terms. Thus higher a word w co-occurs with a query tern q, higher is the likelihood of w to be sampled from the relevance model, i.e. higher is P (w|R). This is shown in Equation 2. R X P (w|qi ) ∝ P (w|Dj )P (qi |Dj ) (2) j=1 We can easily extend this notion of relevance model weighting of terms to whole sentences by simply aggregating over the constituent words of a sentence. This is shown in Equation 3 which we use to score every sentence and select the top-scoring ones in the returned summary. Y P (S|R) = P (w|R) (3) w∈S Topical Relevance Model Similarity. This sentence selection score is based on an extended version of relevance model (RLM) similarity. In our extended relevance model, we compute the probabilities P (w|D)s by marginalizing them over a set of latent topics. Firstly, we estimate the topic distribution over the set of top ranked passages retrieved in the initial step by latent Dirichlet allocation (LDA) [5]. LDA outputs two distribution vectors θ (from document to topic) and φ (from topic to word). Modified smoothed document models are obtained by using these two distributions as shown in Equation 4, where K is the number of topics used in the LDA estimation. K X P (w|D) = P (w|zk , φ)P (zk |D, θ) (4) k=1 We then use the topic smoothed document models in the estimation of RLM i.e. we use the definition of P (w|D) as obtained from Equation 4 in 2 to obtain an extended RLM sentence selection methodology which we name topical relevance model (TRLM) similarity. R X K X  P (w|qi ) ∝ P (w|zk , φ)P (zk |Dj , θ) P (qi |Dj ) (5) j=1 k=1 3 Run Description We submitted three official runs (run ids: 185, 186 and 187) for the INEX-2012 Tweet contextualization task. The first pass passage retrieval for each of the three runs is identical and follows the description of Sections 2.1, 2.2 and 2.3. The sentence retrieval strategies of each of these runs is different. Run 185 used simple language modelling (LM) similarity, run 186 used RLM similarity, whereas run 187 used TRLM similarity to score sentences. The number of top documents used for the (T)RLM estimation was set to 20. For TRLM, the additional parameter K, i.e. the number of topics, was set to 5. Our submissions did not use any automatic summarization techniques for sentence selection. We rather relied on pure IR-based approaches to generate the twweet contexts. 4 Evaluation The tweet contexts were evaluated with two measures: a) informativeness, which measures the closeness of the answer string with a golden reference with the Table 1. Official results for INEX-2012 Tweet contextualization task Run Id Run Description Rank Informativeness Metrics Uni-gram Bi-gram Skip-gram 185 LM sentence retrieval 8 0.8265 0.9129 0.9135 186 RLM sentence retrieval 10 0.8347 0.9210 0.9208 187 TRLM sentence retrieval 11 0.8360 0.9235 0.9237 178 Official best 1 0.7734 0.8616 0.8623 194 Organizers’ baseline 4 0.7864 0.8868 0.8887 Run Id Run Description Rank Readability Metrics Relevance Syntax Structure 185 LM sentence retrieval 1 0.7728 0.7452 0.6446 186 RLM sentence retrieval 5 0.7008 0.6676 0.5636 187 TRLM sentence retrieval 14 0.6093 0.5252 0.4847 194 Organizers’ baseline 4 0.6975 0.6342 0.5703 help of KL divergence between the two; and b) readability, which measures the syntactic coherence of the text such as whether it has grammatical errors, has unresolved anaphora or is redundant etc [6]. Table 1 reports the official results of our three submitted runs. Along with our runs, the table shows the offcial best run as measured by informativeness and also the run submitted by the organizers as the baseline. Informativeness evaluation involves computation of three metrics: the KL divergence between the golden summary and the returned summary for uni-grams, bi-grams, and bi-grams with two allowable gaps in between [6]. Note that KL divergence being a distance measure implies that a lower value of this metric is indicative of a better result. The readability metric on the other hand reports the proportion of text which has correct syntax, structure and is relevant in the context. As a result, a higher value of these metrics indicates a better result. It can be seen that the most simple sentence retrieval technique using LM similarity fairly well, achieving rank eight, as measured by informativeness. This run in fact achieves the best readability result. Our other runs, i.e. the (T)RLM based sentence selection strategies, have not performed well in the official evaluation. The release of official relevance assessments namely the reference summary context for each tweet would enable us to tune the parameters of the two other sentence selection strategies in order to achieve an improved performance. 5 Conclusions and Future work In our first participation at the INEX Tweet contextualization task, we applied sentence retrieval to construct answer fragments for each tweet. Three different sentence selection methodologies were used: i) language modelling (LM) score, ii) relevance modelling (RLM) scoring of a sentence by accumulating over the RLM scores of its constituent terms, and iii) topical relevance modelling (TRLM) scoring of a sentence by accumulating over the topic smoothed RLM scores of its constituent terms. The results confirm that simple IR-based sentence selection techniques can perform fairly well on both the informativeness and the readability metrics, without the application of any complex NLP techniques. The main advantage of the sentence retrieval methodologies is that these are very fast in contrast to computationally intensive NLP methods. Acknowledgments This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (CNGL) project. References 1. Voorhees, E.M.: Overview of the TREC 2003 question answering track. (2003) 54–68 2. Salton, G.: A Blueprint for Automatic Indexing. ACM SIGIR Forum 16(2) (Fall 1981) 22–38 3. Hiemstra, D.: Using Language Models for Information Retrieval. PhD thesis, Center of Telematics and Information Technology, AE Enschede (2000) 4. Lavrenko, V., Croft, B.W.: Relevance based language models. In: Proceedings of the SIGIR ’01, ACM (2001) 120–127 5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003) 993–1022 6. SanJuan, E., Moriceau, V., Tannier, X., Bellot, P., Mothe, J.: Overview of the INEX 2011 Question Answering Track (QA@INEX). In: Pre-proceedings of the INitiative for the Evaluation of XML retrieval workshop (INEX 2011), Saarbrcken (Germany) (December 2011) 145–153