=Paper=
{{Paper
|id=Vol-2036/T2-11
|storemode=property
|title=Use of the Pole-based Overlapping Clustering Method for Labeling Tweets @ IRMiDis FIRE 2017
|pdfUrl=https://ceur-ws.org/Vol-2036/T2-11.pdf
|volume=Vol-2036
|authors=Noushin Fadaei,Thomas Mandl
|dblpUrl=https://dblp.org/rec/conf/fire/FadaeiM17
}}
==Use of the Pole-based Overlapping Clustering Method for Labeling Tweets @ IRMiDis FIRE 2017==
Use of the Pole-based Overlapping Clustering Method for Labeling Tweets @ IRMiDis FIRE 2017 Noushin Fadaei Thomas Mandl Hildesheim University Hildesheim University Hildesheim, Germany Hildesheim, Germany fadaei@uni-hildesheim.de mandl@uni-hildesheim.de 1 INTRODUCTION Consequently all objects which are not part of the poles are assigned Real-time information in social media such as Twitter on major to the clusters based on their similarity with the poles. Based on events, for example major earthquakes or terrorist attacks, plays an those similarities a ranking of the poles is generated for each object important role in revealing critical needs resulted by the disaster and it is always assigned to the most similar one (rank 1). In order and the resources to address them; however this information is to validate the assignment of the clusters on the other ranking po- also influentially clouded by emotional posts. Finding the critical sitions the following condition is required: The average similarity information and finding their relevance to the resources would between the object and the previous as well as the following pole provide a valuable platform for further analysis. needs to be bigger or equal to the similarity of the object to the This working note covers the method we used to tackle the current pole. challenge exposed to discussion by FIRE 2017 track: Information Following the hierarchical agglomerative clustering method “single- Retrieval from Microblogs during Disasters (IRMiDis) [1]. The first link” [4], the hierarchical clusters are being built until all objects task is to identify tweets regarding the required resources such as fill in one single cluster. The algorithm is shown in figure 1. food, water or medical aids and the available or potential avail- able resources like the already transported or distributed resources among a dataset of over 50,000 tweets regarding the Nepal-India 2.2 Expert Knowledge as Relevance Feedback earthquake in 2015. PoBOC builds a graph on top of the similarity matrices of the given A tweet can represent the needs for resources as well as the objects and tries to find out the strongly connected graphs namely availability of some others at specific locations; therefore it would poles; however if we consider these poles are already identified by be reasonable to use a soft clustering algorithm which shows the the experts (similar to relevance feedback methods), our method relevance degree of one tweet to each of the classes. Pole-based tries to find the objects that are close enough to each pole. One overlapping clustering algorithm utilizes this notion to not only alternative solution is to cluster the entire data and consider the detect the proper class for each object but also let the objects share level of hierarchy that covers three cluster (as need class, availabil- various classes in case they are similar enough to each class. ity class and class of irrelevant tweets). Although PoBOC supports overlapping clusters, it doesn’t explain how to organize a soft clus- 2 METHODOLOGY tering over two poles. An object belongs to the closest pole. In case 2.1 Pole-based Overlapping Clustering an object is similar to both of the classes evenly, we considered (PoBOC) it a member of both. The similarity of each object and a pole is calculated as follows: The algorithm starts off with a similarity matrix X × X containing the similarity of each object with all the other objects. Based on this matrix a similarity graph is built, where an edge exists between two 1 Õ s(x i , Pm ) = s(x i , x j ) (1) objects, if the similarity of these two is greater than the average |Pm | x j ∈Pm similarities of each of the two individual objects and the whole set of objects. [2] In the next step the poles representing the individual medoids of where x i is an object and Pm is the mth pole. the clusters are built. For this purpose the cliques within the sim- ilarity graph are obtained. These cliques form subgraphs where We have used Weka [3] for the pre-processing. We removed the each node/vertex is linked to all other nodes/vertices, starting off links and used Snowball stop word list in addition to some few with a node with at least the degree of one, with the lowest average frequent words in the data collection namely earthquake, Nepal similarity to all other objects. The starting nodes for the next poles and death. As tweets are quite short the minimum term frequency are determined in order to minimize the similarity with previous of a word is not of a matter and for weighting tfidf is used to select poles. The termination criterion for this process is again based on the first 500 important words. These terms formed Weka instances the average similarity: if the average similarity of the starting node which are taken as objects in PoBOC algorithm. for the new pole with the already built poles is smaller than the average similarity of that node with all other objects, then the pro- Developing this fuzzy algorithm, Euclidean distance is used as a cess is ended and the number of clusters based on the previously similarity measure and the closer an object is to a pole, the higher found poles is determined. [2] it is ranked. during Disasters (IRMiDis). In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org. [2] Guillaume Cleuziou, Lionel Martin, and Christel Vrain. 2004. PoBOC: an Overlap- ping Clustering Algorithm. Application to Rule-Based Classification and Textual Data. In In R. Lopez de Mantaras and L. Saitta, IOS Press, editor, Proceedings of the 16th European Conference on Artificial Intelligence. 440–444. [3] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explor. Newsl. 11, 1 (Nov. 2009), 10–18. https://doi.org/10.1145/1656274.1656278 [4] A. K. Jain, M. N. Murty, and P. J. Flynn. 1999. Data Clustering: A Review. ACM Comput. Surv. 31, 3 (Sept. 1999), 264–323. https://doi.org/10.1145/331499.331504 Figure 1: PoBOC: an overlapping clustering algorithm [2] Table 1: Sub-task 1- Identifying need-tweets and availability- tweets Submission Detail Availability-Tweets Evaluation Need-Tweets Evaluation Run ID RunType Precision@100Recall@1000 MAP Precision@100Recall@1000 MAP Average MAP Iwist_task1_1Automatic 0.0300 0.0194 0.0165 0.0400 0.1639 0.0417 0.0291 Source: Results of our methodology by IRMiDis - FIRE2017 [1] 3 RESULTS The results of our method regarding task 1 are shown in Table 1. According to the evaluations of the organization committee, it ranked 13t h. The lack of lexical databases and semantic analysis in the pre-processing phase can explain the low performance of this method. Considering the amount of emotional stop words using sentiment analysis can also improve the representation of the vector objects. In our method, we took the given training data as the experts feedback for which there are 674 tweets relevant to the availability class and less than one third of it namely 201 tweets relevant to the need class. Here the presumption is that the relevant tweets form a cluster meaning they have more similar terms overlap, however the relatively low recalls show the distribution of the relevant tweets does not match our assumption. Yet the recall of the need class is much better than the one of the availability class, and since we have less relevant tweets for the need class, it suggests the variety of the wordings regarding each class if not the variety of experts perspectives. 4 FUTURE WORK Due to working on this task within a short time the semantic aspects of the training data and the co-appearance of the word were ignored. Also the relevance feedback methods can play an important role in case we look at the problem as an information retrieval task where training data can be used to produce relevant queries. REFERENCES [1] Moumita Basu, Saptarshi Ghosh, Kripabandhu Ghosh, and Monojit Choudhury. 2017. Overview of the FIRE 2017 track: Information Retrieval from Microblogs 2